Distributed ML Engineer

Application open Full Time

Job Purpose

The Distributed ML Engineer will play a role at the forefront of optimizing performance for the machine learning
software stacks, especially at training and inference, and support the team to develop new and cutting-edge systems.
The ideal candidate will have a strong background in parallel computing, and hands-on experience in system level
coding, debug methodologies, and large-scale machine learning experience.

Key Responsibilities:

  • Understand, analyze, profile, optimize, and provide guidance to the team on deep learning workloads on
    state-of-the-art hardware and software platforms to improve their efficiency with different levels of
    optimization
  • Design and implement performance benchmarks and testing methodologies to evaluate application performance
  • Build tools to automate workload analysis, workload optimization, and other critical workflows
  • Triage system issues and identify bottleneck and inefficiencies by analyzing the sources of issues and the
    impact on hardware, network and propose solutions to enhance GPU utilization
  • Support the team to develop appropriate kernels and systems for new model architectures and algorithms
  • Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies.
  • Review code developed by other developers and provide feedback to ensure best practices (e.g., style
    guidelines, checking code in, accuracy, testability, and efficiency).
  • Contribute to existing documentation or educational content and adapt content based on product/program updates
    and user feedback.
  • Represent MBZUAI at industry conferences and events, showcasing the institution’s cutting-edge HPC and deep
    learning capabilities and establishing MBZUAI as a global leader in AI research and innovation.
  • Perform all other duties as reasonably directed by the line manager that are commensurate with these
    functional objectives.

Academic Qualifications

  • Ph.D. in CS, EE or CSEE with 1+ years working experience/ Masters in CS, EE or CSEE or equivalent experience
    with 2+ year working experience

Minimum Professional
Experience

  • Background in deep learning model architectures and experience with Pytorch and large-scale distributed
    training.
  • Proficiency in Python and C/C++ for analyzing and optimizing code.
  • Excellent problem-solving and troubleshooting skills to address complex technical challenges.
  • Effective communication and collaboration skills to work with cross functional teams.
  • Experience using multi node GPU infrastructure

Preferred Professional
Experience

  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
  • Have a deep understanding of GPU, CPU, or other AI accelerator architectures
  • Have experience writing and optimizing compute kernels with CUDA/Triton or similar languages.
  • Are familiar with LLM architectures and training infrastructure.
  • Have experience driving ML accuracy with low-precision formats.
  • Have 3+ years of relevant industry experience.
  • Experience in performance optimization of large-scale distributed systems.
  • Systematic problem-solving approach, coupled with effective verbal and written communication skills.

Apply to vacancy

Click or drag a file to this area to upload.
Click or drag a file to this area to upload.