Research Engineer, Infrastructure, Numerics
$350k - $475k • San Francisco
Posted 1mo ago
Job Location
San Francisco
Tech Stack
Remote Work Policy
On-site
Categories
AI Infrastructure Engineer
About the job
Thinking Machines Lab is seeking an infrastructure research engineer to design and build core systems for efficient large-scale model training, with a specific focus on numerics. This role involves enhancing the numerical foundations of their distributed training stack, optimizing precision formats, kernel optimizations, and communication frameworks to ensure stable, scalable, and fast training of trillion-parameter models. The ideal candidate will bridge research and systems engineering, possessing a strong understanding of both optimization mathematics and distributed compute realities.
Responsibilities
- Design and optimize distributed training infrastructure for large-scale LLMs, focusing on performance, stability, and reproducibility across multi-GPU and multi-node setups.
- Implement and evaluate low-precision numerics (e.g., BF16, MXFP8, NVFP4) to improve efficiency without sacrificing model quality.
- Develop kernels and communication primitives leveraging hardware-level support for mixed and low-precision arithmetic.
- Collaborate with research teams to co-design model architectures and training recipes aligned with emerging numeric formats and stability constraints.
- Prototype and benchmark scaling strategies like data, tensor, and pipeline parallelism, integrating precision-adaptive computation and quantized communication.
- Contribute to the design of internal orchestration and monitoring systems for efficient and reproducible distributed experiments.
- Publish and share learnings through internal documentation, open-source libraries, or technical reports to advance scalable AI infrastructure.
Requirements
- Bachelor’s degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or a similar field.
- Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.
- Ability to thrive in a highly collaborative environment with cross-functional partners and subject matter experts.
- Proactive and initiative-driven mindset to work across different stacks and teams.
- Strong engineering skills, including the ability to contribute performant, maintainable code and debug complex codebases in areas like floating-point numerics, low-precision arithmetic, and distributed systems.
- Familiarity with distributed frameworks such as PyTorch/XLA, DeepSpeed, Megatron-LM.
- Experience implementing FP8, INT8, or block-floating point (MX) formats and understanding their numerical trade-offs.
- Prior contributions to open-source deep learning infrastructure (e.g., PyTorch, DeepSpeed, XLA).
- Publications, patents, or projects related to numerical optimization, communication-efficient training, or systems for large models.
- Experience training and supporting large-scale AI models.
- Track record of improving research productivity through infrastructure design or process improvements.
Benefits
- Generous health, dental, and vision benefits
- Unlimited PTO
- Paid parental leave
- Relocation support