Job Purpose
The Distributed ML Engineer will play a role at the forefront of optimizing performance for the machine learning
software stacks, especially at training and inference, and support the team to develop new and cutting-edge systems.
The ideal candidate will have a strong background in parallel computing, and hands-on experience in system level
coding, debug methodologies, and large-scale machine learning experience.
Key Responsibilities:
- Understand, analyze, profile, optimize, and provide guidance to the team on deep learning workloads on
state-of-the-art hardware and software platforms to improve their efficiency with different levels of
optimization
- Design and implement performance benchmarks and testing methodologies to evaluate application performance
- Build tools to automate workload analysis, workload optimization, and other critical workflows
- Triage system issues and identify bottleneck and inefficiencies by analyzing the sources of issues and the
impact on hardware, network and propose solutions to enhance GPU utilization
- Support the team to develop appropriate kernels and systems for new model architectures and algorithms
- Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies.
- Review code developed by other developers and provide feedback to ensure best practices (e.g., style
guidelines, checking code in, accuracy, testability, and efficiency).
- Contribute to existing documentation or educational content and adapt content based on product/program updates
and user feedback.
- Represent MBZUAI at industry conferences and events, showcasing the institution’s cutting-edge HPC and deep
learning capabilities and establishing MBZUAI as a global leader in AI research and innovation.
- Perform all other duties as reasonably directed by the line manager that are commensurate with these
functional objectives.
Academic Qualifications
- Ph.D. in CS, EE or CSEE with 1+ years working experience/ Masters in CS, EE or CSEE or equivalent experience
with 2+ year working experience
Minimum Professional
Experience
- Background in deep learning model architectures and experience with Pytorch and large-scale distributed
training.
- Proficiency in Python and C/C++ for analyzing and optimizing code.
- Excellent problem-solving and troubleshooting skills to address complex technical challenges.
- Effective communication and collaboration skills to work with cross functional teams.
- Experience using multi node GPU infrastructure
Preferred Professional
Experience
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
- Have a deep understanding of GPU, CPU, or other AI accelerator architectures
- Have experience writing and optimizing compute kernels with CUDA/Triton or similar languages.
- Are familiar with LLM architectures and training infrastructure.
- Have experience driving ML accuracy with low-precision formats.
- Have 3+ years of relevant industry experience.
- Experience in performance optimization of large-scale distributed systems.
- Systematic problem-solving approach, coupled with effective verbal and written communication skills.