Job Purpose
As a Senior HPC Engineer at MBZUAI, the individual will be the driving force behind the
university’s computational transformation, empowering world-class researchers to push the boundaries of deep
learning and artificial intelligence. Leveraging expertise in GPU cluster management and distributed computing, the
engineer will architect and optimize a state-of-the-art HPC infrastructure that enables the efficient training of
large-scale neural networks.
By seamlessly integrating cutting-edge parallel processing capabilities into the research workflows, the engineer
will accelerate the development of groundbreaking AI solutions that have the potential to reshape entire industries.
Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for
high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI
pioneers.
Key Responsibilities:
- 1- Cloud & On-Prem HPC Infrastructure Management and Optimization
- Design, deploy, and maintain state-of-the-art GPU clusters to support the computational demands of
MBZUAI’s research initiatives.
- Continuously monitor and optimize the performance, reliability, and scalability of the HPC
infrastructure to ensure maximum uptime and efficiency.
- Implement advanced resource management strategies to efficiently allocate computing resources based on
project priorities and emerging research needs.
- Ensure the HPC environment adheres to the highest standards of security, compliance, and energy
efficiency to safeguard sensitive data and minimize environmental impact.
- Develop and oversee comprehensive system monitoring and alerting mechanisms to proactively identify and
resolve issues before they impact research workflows.
- 2- Distributed and Parallel Deep Learning Capabilities
- Spearhead the implementation of distributed computing techniques to enable parallel training of
large-scale deep learning models across multiple GPUs and nodes.
- Collaborate closely with data scientists and machine learning engineers to seamlessly integrate
distributed training capabilities into existing deep learning frameworks (e.g., TensorFlow, PyTorch,
MXNet).
- Develop and optimize data distribution and synchronization strategies to achieve faster model
convergence and reduced training times, empowering researchers to explore more ambitious AI projects.
- Continuously research and implement the latest advancements in distributed deep learning to maintain
MBZUAI’s competitive edge and ensure the institution remains at the forefront of cutting-edge AI
research.
- Provide comprehensive training and support to research teams on the effective utilization of distributed
deep-learning workflows.
- 3- Performance Engineering and Profiling
- Utilize advanced profiling tools and techniques to identify and resolve performance bottlenecks within
the HPC environment, ensuring researchers can maximize the productivity of their computational
resources.
- Fine-tune GPU clusters and deep learning frameworks to deliver optimal performance for specific research
workloads and use cases, driving breakthroughs in areas such as computer vision, natural language
processing, and scientific computing.
- Implement innovative strategies to maximize the utilization and efficiency of MBZUAI’s computing
resources, including techniques like dynamic resource allocation and workload prioritization.
- Develop and maintain a comprehensive knowledge base of performance optimization best practices,
promoting cross-functional collaboration and knowledge sharing among HPC, data science, and machine
learning teams.
- 4- Thought Leadership and Collaboration
- Proactively engage with the global HPC and deep learning research communities to stay abreast of the
latest trends and innovations and identify opportunities for MBZUAI to contribute to and influence the
direction of the field.
- Contribute to the development of MBZUAI’s strategic roadmap for HPC and deep learning infrastructure,
ensuring the institution maintains a cutting-edge technological advantage and supports the evolving
needs of its research programs.
- Mentor and train junior HPC engineers, fostering a culture of continuous learning and knowledge sharing
to build a sustainable and highly skilled HPC team.
- Represent MBZUAI at industry conferences and events, showcasing the institution’s cutting-edge HPC and
deep learning capabilities and establishing MBZUAI as a global leader in AI research and innovation.
- 5- Technical Expertise and Innovation
- Proficiency in at least one of the public Cloud HPC services. Preferably AWS SageMaker HyperPod.
- Generic DevOps skills like TerraForm, CloudFormation, and Ansible.
- Solid System Administration skills including VMs, VPC, object & block storage, VPN, Firewall, IAM, etc.
- Maintain in-depth knowledge of the latest advancements in GPU hardware, deep learning frameworks, and
parallel computing technologies.
- Explore and evaluate emerging HPC and deep learning trends, and champion the adoption of innovative
solutions that can drive MBZUAI’s research agenda forward.
- Collaborate with cross-functional teams to develop novel approaches and methodologies for optimizing the
performance and utilization of MBZUAI’s HPC resources.
- Contribute to the development of intellectual property and thought leadership in the field of
high-performance computing for deep learning.
- 6- Other Duties
- Perform all other duties as reasonably directed by the line manager that are commensurate with these
functional objectives.
Academic
Qualification
- Bachelor’s degree in Computer Science, Electrical Engineering, or a related technical field with a focus on
High-Performance Computing, Parallel Processing, or Distributed System or Deep Learning.
Professional
Experience
- 3+ years of proven experience in managing GPU clusters, including installation, configuration, and optimization.
- Extensive experience in the design, deployment, and management of GPU clusters, including configuration,
monitoring, and performance optimization.
- Proven expertise in implementing distributed computing techniques and parallel training strategies for deep
learning models.
- Strong proficiency in popular deep learning frameworks (e.g., TensorFlow, PyTorch, MXNet) and GPU-accelerated
programming (e.g., CUDA, cuDNN).
- Hands-on experience with performance profiling and optimization tools for HPC and deep learning.
- Knowledge of resource management and scheduling systems (e.g., SLURM, Kubernetes).
- Excellent problem-solving and troubleshooting skills to address complex technical challenges.
- Effective communication and collaboration skills to work with cross functional teams.