Research Engineer, Infrastructure, Numerics

$350k - $475k San Francisco

Posted 1mo ago

Remote Work Policy

On-site

Categories

AI Infrastructure Engineer

About the job

Thinking Machines Lab is seeking an infrastructure research engineer to design and build core systems for efficient large-scale model training, with a specific focus on numerics. This role involves enhancing the numerical foundations of their distributed training stack, optimizing precision formats, kernel optimizations, and communication frameworks to ensure stable, scalable, and fast training of trillion-parameter models. The ideal candidate will bridge research and systems engineering, possessing a strong understanding of both optimization mathematics and distributed compute realities.

Responsibilities

  • Design and optimize distributed training infrastructure for large-scale LLMs, focusing on performance, stability, and reproducibility across multi-GPU and multi-node setups.
  • Implement and evaluate low-precision numerics (e.g., BF16, MXFP8, NVFP4) to improve efficiency without sacrificing model quality.
  • Develop kernels and communication primitives leveraging hardware-level support for mixed and low-precision arithmetic.
  • Collaborate with research teams to co-design model architectures and training recipes aligned with emerging numeric formats and stability constraints.
  • Prototype and benchmark scaling strategies like data, tensor, and pipeline parallelism, integrating precision-adaptive computation and quantized communication.
  • Contribute to the design of internal orchestration and monitoring systems for efficient and reproducible distributed experiments.
  • Publish and share learnings through internal documentation, open-source libraries, or technical reports to advance scalable AI infrastructure.

Requirements

  • Bachelor’s degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or a similar field.
  • Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.
  • Ability to thrive in a highly collaborative environment with cross-functional partners and subject matter experts.
  • Proactive and initiative-driven mindset to work across different stacks and teams.
  • Strong engineering skills, including the ability to contribute performant, maintainable code and debug complex codebases in areas like floating-point numerics, low-precision arithmetic, and distributed systems.
  • Familiarity with distributed frameworks such as PyTorch/XLA, DeepSpeed, Megatron-LM.
  • Experience implementing FP8, INT8, or block-floating point (MX) formats and understanding their numerical trade-offs.
  • Prior contributions to open-source deep learning infrastructure (e.g., PyTorch, DeepSpeed, XLA).
  • Publications, patents, or projects related to numerical optimization, communication-efficient training, or systems for large models.
  • Experience training and supporting large-scale AI models.
  • Track record of improving research productivity through infrastructure design or process improvements.

Benefits

  • Generous health, dental, and vision benefits
  • Unlimited PTO
  • Paid parental leave
  • Relocation support

About thinkingmachines

Get new AI jobs in your inbox

A weekly digest of the newest LLM, RAG, and AI agent engineering roles.

© 2026 AI Job Board. All rights reserved.