Research, Pre-Training Data
$350k - $475k • San Francisco
Posted 1mo ago
Job Location
San Francisco
Tech Stack
Remote Work Policy
On-site
Categories
AI Research Engineer
About the job
Thinking Machines Lab is seeking pre-training researchers to join their mission of advancing collaborative general intelligence. This role is central to developing the next generation of AI models by blending research with large-scale data engineering. You will be responsible for assembling pre-training datasets and data systems, designing and implementing methods for sourcing, curating, and analyzing data for quality and performance. The position involves working with automated pipelines and human-in-the-loop processes, contributing both scientific insights and production-grade code. It's an ideal opportunity for individuals passionate about the intersection of data, machine learning, and systems, and who are eager to shape the future of AI.
Responsibilities
- Design and implement techniques for curating, sourcing, and filtering large-scale text, code, and multimodal data.
- Develop data quality metrics and analysis to measure coverage, diversity, and representativeness across sources.
- Collaborate with research and infrastructure teams to scale data processing systems efficiently and reproducibly.
- Investigate and mitigate data risks, including privacy, safety, and licensing concerns, to ensure responsible and ethical data use.
- Continuously evaluate dataset improvements by analyzing their downstream effects on model learning and behavior.
- Publish and present research that moves the entire community forward, sharing code, datasets, and insights.
Requirements
- Proficiency in Python and familiarity with at least one deep learning framework (e.g., PyTorch, TensorFlow, or JAX).
- Comfortable with debugging distributed training and writing code that scales.
- Bachelor’s degree or equivalent experience in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding.
- Clarity in communication and ability to explain complex technical concepts in writing.
- Strong grasp of probability, statistics, and ML fundamentals.
- Experience with curation, preprocessing, and analysis of large-scale text, code, or multimodal datasets.
- Prior experience in data engineering, dataset construction, or large-scale web data processing for machine learning models.
- Experience evaluating or improving training data quality and knowledge of data ethics, safety, and licensing frameworks relevant to AI dataset creation.
- Contributions to open datasets, research publications, or data tooling.
- PhD in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding; or equivalent industry research experience.
Benefits
- Generous health, dental, and vision benefits
- Unlimited PTO
- Paid parental leave
- Relocation support