Research Engineer, Infrastructure, Numerics

$350k - $475k • San Francisco

Posted 1mo ago

Job Location

San Francisco

Tech Stack

OpenAI Mistral PyTorch Deep Learning JAX Distributed Training BF16 MXFP8 NVFP4 FP8 INT8 PyTorch/XLA DeepSpeed Megatron-LM

Remote Work Policy

On-site

About the job

Thinking Machines Lab is seeking an infrastructure research engineer to design and build core systems for efficient large-scale model training, with a specific focus on numerics. This role involves enhancing the numerical foundations of their distributed training stack, optimizing precision formats, kernel optimizations, and communication frameworks to ensure stable, scalable, and fast training of trillion-parameter models. The ideal candidate will bridge research and systems engineering, possessing a strong understanding of both optimization mathematics and distributed compute realities.

Responsibilities

Design and optimize distributed training infrastructure for large-scale LLMs, focusing on performance, stability, and reproducibility across multi-GPU and multi-node setups.
Implement and evaluate low-precision numerics (e.g., BF16, MXFP8, NVFP4) to improve efficiency without sacrificing model quality.
Develop kernels and communication primitives leveraging hardware-level support for mixed and low-precision arithmetic.
Collaborate with research teams to co-design model architectures and training recipes aligned with emerging numeric formats and stability constraints.
Prototype and benchmark scaling strategies like data, tensor, and pipeline parallelism, integrating precision-adaptive computation and quantized communication.
Contribute to the design of internal orchestration and monitoring systems for efficient and reproducible distributed experiments.
Publish and share learnings through internal documentation, open-source libraries, or technical reports to advance scalable AI infrastructure.

Requirements

Bachelor’s degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or a similar field.
Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.
Ability to thrive in a highly collaborative environment with cross-functional partners and subject matter experts.
Proactive and initiative-driven mindset to work across different stacks and teams.
Strong engineering skills, including the ability to contribute performant, maintainable code and debug complex codebases in areas like floating-point numerics, low-precision arithmetic, and distributed systems.
Familiarity with distributed frameworks such as PyTorch/XLA, DeepSpeed, Megatron-LM.
Experience implementing FP8, INT8, or block-floating point (MX) formats and understanding their numerical trade-offs.
Prior contributions to open-source deep learning infrastructure (e.g., PyTorch, DeepSpeed, XLA).
Publications, patents, or projects related to numerical optimization, communication-efficient training, or systems for large models.
Experience training and supporting large-scale AI models.
Track record of improving research productivity through infrastructure design or process improvements.

Benefits

Generous health, dental, and vision benefits
Unlimited PTO
Paid parental leave
Relocation support

About thinkingmachines

View company profile