Senior High Performance Computing Engineer

Home / LLMVacancy / Senior High Performance Computing Engineer

Senior High Performance Computing Engineer

Application open Full Time

Job Purpose

As an HPC Expert, you will play a pivotal role in advancing our organization’s computational capabilities and accelerating the development of cutting-edge deep learning solutions. You will be responsible for managing and optimizing the performance of GPU clusters and implementing distributed computing strategies to efficiently train large-scale deep learning models.

Location

Abu Dhabi Only

Affiliation

Successful applicants may choose to work at MBZUAI or Inception (a G42 company) as per mutual agreements.

Key Responsibilities

Functional

GPU cluster management

Design, deploy, and maintain high-performance GPU clusters, ensuring their stability, reliability, and scalability.
Monitor and manage cluster resources to maximize utilization and efficiency.

Distributed/parallel training

Implement distributed computing techniques to enable parallel training of large deep learning models across multiple GPUs and nodes.
Optimize data distribution and synchronization to achieve faster convergence and reduced training times.

Performance optimization

Fine-tune GPU clusters and deep learning frameworks to achieve optimal performance for specific workloads.
Identify and resolve performance bottlenecks through profiling and system analysis.

Deep learning framework integration

Collaborate with data scientists and machine learning engineers to integrate distributed training capabilities into existing deep learning frameworks (e.g., TensorFlow, PyTorch, MXNet).

Scalability and resource management

Ensure GPU clusters can scale effectively to handle increasing computational demands.
Develop resource management strategies to prioritize and allocate computing resources based on project requirements.

Security and compliance

Implement security measures to protect GPU clusters and data while adhering to industry best practices and compliance standards.

Troubleshooting and support

Troubleshoot and resolve issues related to GPU clusters, distributed training, and performance anomalies.
Provide technical support to users and resolve technical challenges efficiently.

Documentation

- Create and maintain documentation related to GPU cluster configuration, distributed training workflows, and best practices to ensure knowledge sharing and seamless onboarding of new team members.

Job Specifications

Academic Qualification

A Master’s degree or Ph.D. in Computer Science or a related field with a focus on high-performance computing, distributed systems, or deep learning.

Professional Experience

Essential

At least three years experience successfully managing GPU clusters, including installation, configuration, and optimization.
Demonstrably strong expertise in distributed deep learning and parallel training techniques.
Proven proficiency in popular deep learning frameworks like TensorFlow, PyTorch, or MXNet.
Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
Knowledge of performance profiling and optimization tools for HPC and deep learning.
Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes).