Research, Pre-Training Data

$350k - $475k San Francisco

Posted 1mo ago

Remote Work Policy

On-site

Categories

AI Research Engineer

About the job

Thinking Machines Lab is seeking pre-training researchers to join their mission of advancing collaborative general intelligence. This role is central to developing the next generation of AI models by blending research with large-scale data engineering. You will be responsible for assembling pre-training datasets and data systems, designing and implementing methods for sourcing, curating, and analyzing data for quality and performance. The position involves working with automated pipelines and human-in-the-loop processes, contributing both scientific insights and production-grade code. It's an ideal opportunity for individuals passionate about the intersection of data, machine learning, and systems, and who are eager to shape the future of AI.

Responsibilities

  • Design and implement techniques for curating, sourcing, and filtering large-scale text, code, and multimodal data.
  • Develop data quality metrics and analysis to measure coverage, diversity, and representativeness across sources.
  • Collaborate with research and infrastructure teams to scale data processing systems efficiently and reproducibly.
  • Investigate and mitigate data risks, including privacy, safety, and licensing concerns, to ensure responsible and ethical data use.
  • Continuously evaluate dataset improvements by analyzing their downstream effects on model learning and behavior.
  • Publish and present research that moves the entire community forward, sharing code, datasets, and insights.

Requirements

  • Proficiency in Python and familiarity with at least one deep learning framework (e.g., PyTorch, TensorFlow, or JAX).
  • Comfortable with debugging distributed training and writing code that scales.
  • Bachelor’s degree or equivalent experience in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding.
  • Clarity in communication and ability to explain complex technical concepts in writing.
  • Strong grasp of probability, statistics, and ML fundamentals.
  • Experience with curation, preprocessing, and analysis of large-scale text, code, or multimodal datasets.
  • Prior experience in data engineering, dataset construction, or large-scale web data processing for machine learning models.
  • Experience evaluating or improving training data quality and knowledge of data ethics, safety, and licensing frameworks relevant to AI dataset creation.
  • Contributions to open datasets, research publications, or data tooling.
  • PhD in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding; or equivalent industry research experience.

Benefits

  • Generous health, dental, and vision benefits
  • Unlimited PTO
  • Paid parental leave
  • Relocation support

About thinkingmachines

Get new AI jobs in your inbox

A weekly digest of the newest LLM, RAG, and AI agent engineering roles.

© 2026 AI Job Board. All rights reserved.