Distributed ML Engineer

Home / Careers / Distributed ML Engineer

Distributed ML Engineer

Application open Full Time

Job Purpose

The Distributed ML Engineer will play a role at the forefront of optimizing performance for the machine learning
software stacks, especially at training and inference, and support the team to develop new and cutting-edge systems.
The ideal candidate will have a strong background in parallel computing, and hands-on experience in system level
coding, debug methodologies, and large-scale machine learning experience.

Key Responsibilities:

Understand, analyze, profile, optimize, and provide guidance to the team on deep learning workloads on
state-of-the-art hardware and software platforms to improve their efficiency with different levels of
optimization
Design and implement performance benchmarks and testing methodologies to evaluate application performance
Build tools to automate workload analysis, workload optimization, and other critical workflows
Triage system issues and identify bottleneck and inefficiencies by analyzing the sources of issues and the
impact on hardware, network and propose solutions to enhance GPU utilization
Support the team to develop appropriate kernels and systems for new model architectures and algorithms
Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies.
Review code developed by other developers and provide feedback to ensure best practices (e.g., style
guidelines, checking code in, accuracy, testability, and efficiency).
Contribute to existing documentation or educational content and adapt content based on product/program updates
and user feedback.
Represent MBZUAI at industry conferences and events, showcasing the institution’s cutting-edge HPC and deep
learning capabilities and establishing MBZUAI as a global leader in AI research and innovation.
Perform all other duties as reasonably directed by the line manager that are commensurate with these
functional objectives.

Academic Qualifications

Ph.D. in CS, EE or CSEE with 1+ years working experience/ Masters in CS, EE or CSEE or equivalent experience
with 2+ year working experience

Minimum Professional
Experience

Background in deep learning model architectures and experience with Pytorch and large-scale distributed
training.
Proficiency in Python and C/C++ for analyzing and optimizing code.
Excellent problem-solving and troubleshooting skills to address complex technical challenges.
Effective communication and collaboration skills to work with cross functional teams.
Experience using multi node GPU infrastructure

Preferred Professional
Experience

Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
Have a deep understanding of GPU, CPU, or other AI accelerator architectures
Have experience writing and optimizing compute kernels with CUDA/Triton or similar languages.
Are familiar with LLM architectures and training infrastructure.
Have experience driving ML accuracy with low-precision formats.
Have 3+ years of relevant industry experience.
Experience in performance optimization of large-scale distributed systems.
Systematic problem-solving approach, coupled with effective verbal and written communication skills.

Distributed ML Engineer

Job Purpose

Key Responsibilities:

Academic Qualifications

Minimum Professional Experience

Preferred Professional Experience

Apply to vacancy

Minimum Professional
Experience

Preferred Professional
Experience