Dautom

AI Infrastructure Engineer

Dautom
Sharjah, Sharjah Emirate, United Arab Emirates Full-timePosted 2 Jul 2026
IT Services and IT Consulting

Sign in to apply

Create a free account to apply for AI Infrastructure Engineer and track your applications.

One-click apply and track your application status
Save jobs and build your shortlist
Get alerts for new AI & ML jobs in UAE
Job Description

The AI Infrastructure Engineer is a platform specialist responsible for architecting, building, and operating high-performance AI infrastructure to support advanced AI workloads, including LLMs, GenAI, Computer Vision, and MLOps. This role will focus on managing GPU clusters (NVIDIA A100/H100), deploying and maintaining Red Hat OpenShift AI (RHODS), and ensuring secure, scalable, and cost-efficient AI platforms across SDD’s Sovereign Cloud and hybrid/multi-cloud environments. The engineer will enable enterprise-grade AI adoption for 200+ government entities.Key Responsibilities & DeliverablesGPU & AI Platform ArchitectureDesign and implement GPU-based compute clusters. Define reference architectures for LLM hosting, Vector Databases, MLOps, and high-performance storage/networking.Fully operational GPU-based AI infrastructure. GPU Cluster Uptime and Performance Utilization. Reduction in Cost per Training/Inference Workload.GPU Cluster OperationsInstall, configure, and optimize core components: CUDA, cuDNN, NCCL, NVIDIA Drivers, and GPU Operators. Implement GPU partitioning, scheduling, and performance tuning for high-end GPUs (e.g., A100/H100).High-availability architecture for all AI workloads. Complete documentation and runbooks.OpenShift AI (RHODS) ManagementDeploy, configure, and maintain the Red Hat OpenShift AI (RHODS) platform for multi-tenant use. Manage the integration of NVIDIA GPU Operator for efficient GPU scheduling and support Data Scientists with Notebooks, Training, and Inference Endpoints.Production-ready OpenShift AI (RHODS) platform. AI Project Onboarding Speed.LLM & Model ServingBuild and manage infrastructure for hosting and serving open-source LLM frameworks (Llama, Falcon, Mistral) and supporting RAG pipelines, LoRA adapters, and Vector Databases (Milvus, pgvector).Multi-model LLM serving environment for entities. MLOps Pipeline Success Rate and Deployment Frequency.MLOps & AutomationImplement IaC (Terraform, Ansible) and GitOps for the automated lifecycle management of the AI platform (node onboarding, scaling, model rollout/rollback). Build robust MLOps pipelines for data prep, training, evaluation, and monitoring (using tools like MLflow/Kubeflow).Infrastructure automation via Terraform & Ansible. Automation Coverage for AI Infrastructure.Required Qualifications & ExperienceExperience: 7–12 years in Cloud Infrastructure, DevOps, ML Infrastructure, or Platform Engineering.Deep Hands-On Expertise:GPU Systems (NVIDIA A100/H100), Linux, Containers, and Kubernetes.OpenShift AI (RHODS) or equivalent Kubernetes GPU orchestration.LLM Hosting (Llama, Mistral, Falcon, etc.) and supporting Vector Databases/RAG systems.Strong Experience In: TensorFlow, PyTorch, Hugging Face, Distributed Training (DDP, Deep Speed), and ML Ops Stacks (ML flow, Kubeflow).Essential Skills & CompetenciesTechnical: Deep understanding of GPU compute, HPC architectures, and ML performance profiling. Strong skills in IaC (Terraform/Ansible), CI/CD, and OpenShift/Kubernetes operators.Soft Skills: Strong troubleshooting, optimization, and performance engineering mindset. Excellent cross-functional collaboration and documentation skills.Preferred CertificationsNVIDIA Deep Learning / AI Infrastructure CertificationRed Hat OpenShift AI specializationKubernetes CKA/CKADAzure AI or Oracle Cloud AI certificationsTerraform & Ansible certifications

About Dautom

Industry

IT Services and IT Consulting

Application Tips

Tailor your CV

Highlight your most relevant AI/ML experience

Research Dautom

Check their AI products and latest news

Show impact

Use metrics to quantify your achievements