Job Purpose
As a Deep Learning Data Engineer, your role will be instrumental in building and maintaining our data infrastructure, with a focus on handling large-scale data for training and inference of deep learning models. You will be responsible for data crawling, cleaning, and transforming raw data into formats suitable for training complex deep learning models. Your expertise in big data platforms like MapReduce, Hadoop, Spark, and Kubernetes will be crucial in efficiently processing and managing data for large deep learning tasks.
Location
Paris, Abu Dhabi or [Silicon Valley]
Affiliation
Successful applicants may choose to work at MBZUAI or Inception (a G42 company) as per mutual agreements.
Key Responsibilities
Data crawling and collection
- Develop and implement advanced data crawling strategies to acquire vast amounts of structured and unstructured data from diverse sources, including websites, APIs, and databases.
Data cleaning and preprocessing
- Apply sophisticated data cleaning techniques to handle missing or inconsistent data, ensuring high-quality data for training large deep learning models.
Data transformation
- Design and implement data transformation pipelines optimized for processing and preparing data for training complex deep learning models.
Big data processing
- Utilize proficiency in big data platforms such as MapReduce, Hadoop, Spark, etc. to efficiently process and analyze large-scale datasets required for training large deep learning models.
Database management
- Establish and manage databases tailored to store and access large volumes of processed data, ensuring data security, reliability, and efficient data retrieval.
Extract, Transform, Load (ETL)
- Develop and maintain ETL workflows that effectively extract data from diverse sources, transform it to meet deep learning model requirements, and load it into data warehouses or databases.
Performance optimization
- Optimize data processing workflows and algorithms to achieve superior performance for training and inference of large deep learning models.
Data modeling for deep learning
- Collaborate closely with data scientists and deep learning researchers to understand data requirements and design appropriate data models that cater to the needs of large and complex deep learning tasks.
Data governance
- Implement robust data governance practices to ensure data accuracy, security, and compliance with data regulations, especially when working with sensitive data.
Big data platform management
- Manage and configure big data platforms to ensure their stability, scalability, and seamless integration with deep learning workflows.
Documentation
- Document data engineering processes, data flows, and data models specific to large deep learning tasks, enabling knowledge sharing and future reference.
Job Specifications
Academic Qualification
Completion of a Bachelor’s or Master’s degree in a relevant field such as Computer Science, Software Engineering, Data Science, or a closely related discipline, with a focus on data management, processing, and engineering.
Professional Experience
Essential
- At least six years’ programming experience with solid coding skills in Python, Shell, and Java.
- At least four years of relevant experience as a data engineer within the data and analytics domain.
- Demonstrable expertise in big data platforms such as MapReduce, Hadoop, Spark, etc.
- Experience with solution architecture, data ingestion, query optimization, data segregation, ETL, ELT, AWS, EC2, S3, SQS, lambda, ElasticSearch, Redshift, CI/CD frameworks, and workflows.
- Working knowledge of data platform concepts data lake, data warehouse, ETL, big data processing (designing and supporting variety/velocity/volume), real-time processing architecture for data platforms, scheduling and monitoring of ETL/ELT jobs.
- RDS databases like PostgreSQL and programming (preferably Java, Python), proficiency in understanding data, entity relationships, structured & unstructured data, SQL and NoSQL databases.
- Knowledge of best practices in optimizing columnar and distributed data processing systems and infrastructure.
- Experience in designing and implementing dimensional modeling.
- Knowledge of machine learning and data mining techniques in one or more areas of statistical modeling, text mining, and information retrieval.
- Good corporate capacity and communication skills.