Research, Pre-Training Data

$350k - $475k • San Francisco

Posted 1mo ago

Job Location

San Francisco

Tech Stack

OpenAI Mistral Python PyTorch TensorFlow Deep Learning JAX Distributed Training Machine Learning Data Engineering Multimodal Data

Remote Work Policy

On-site

About the job

Thinking Machines Lab is seeking pre-training researchers to join their mission of advancing collaborative general intelligence. This role is central to developing the next generation of AI models by blending research with large-scale data engineering. You will be responsible for assembling pre-training datasets and data systems, designing and implementing methods for sourcing, curating, and analyzing data for quality and performance. The position involves working with automated pipelines and human-in-the-loop processes, contributing both scientific insights and production-grade code. It's an ideal opportunity for individuals passionate about the intersection of data, machine learning, and systems, and who are eager to shape the future of AI.

Responsibilities

Design and implement techniques for curating, sourcing, and filtering large-scale text, code, and multimodal data.
Develop data quality metrics and analysis to measure coverage, diversity, and representativeness across sources.
Collaborate with research and infrastructure teams to scale data processing systems efficiently and reproducibly.
Investigate and mitigate data risks, including privacy, safety, and licensing concerns, to ensure responsible and ethical data use.
Continuously evaluate dataset improvements by analyzing their downstream effects on model learning and behavior.
Publish and present research that moves the entire community forward, sharing code, datasets, and insights.

Requirements

Proficiency in Python and familiarity with at least one deep learning framework (e.g., PyTorch, TensorFlow, or JAX).
Comfortable with debugging distributed training and writing code that scales.
Bachelor’s degree or equivalent experience in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding.
Clarity in communication and ability to explain complex technical concepts in writing.
Strong grasp of probability, statistics, and ML fundamentals.
Experience with curation, preprocessing, and analysis of large-scale text, code, or multimodal datasets.
Prior experience in data engineering, dataset construction, or large-scale web data processing for machine learning models.
Experience evaluating or improving training data quality and knowledge of data ethics, safety, and licensing frameworks relevant to AI dataset creation.
Contributions to open datasets, research publications, or data tooling.
PhD in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding; or equivalent industry research experience.

Benefits

Generous health, dental, and vision benefits
Unlimited PTO
Paid parental leave
Relocation support

About thinkingmachines

View company profile