Research Engineer, Infrastructure, Inference

$350k - $475k • San Francisco

Posted 1mo ago

Job Location

San Francisco

Tech Stack

OpenAI Mistral Kubernetes PyTorch Deep Learning JAX GPU SGLang vLLM Ray SLURM Inference

Remote Work Policy

On-site

About the job

Thinking Machines Lab is seeking an infrastructure research engineer to design, optimize, and scale the systems that power large AI models. The goal is to make inference faster, more cost-effective, more reliable, and more reproducible, enabling research teams to focus on advancing model capabilities. This role is crucial for ensuring that every experiment, evaluation, and deployment runs smoothly at scale, with a focus on performant and efficient model inference for both real-world applications and research acceleration.

Responsibilities

Bring cutting-edge AI models into production in collaboration with researchers and engineers.
Enable high-performance inference for novel architectures by collaborating with research teams.
Design and implement new techniques, tools, and architectures to improve performance, latency, throughput, and efficiency.
Optimize codebase and compute fleet (e.g., GPUs) to maximize hardware FLOPs, bandwidth, and memory utilization.
Extend orchestration frameworks (e.g., Kubernetes, Ray, SLURM) for distributed inference, evaluation, and large-batch serving.
Establish standards for reliability, observability, and reproducibility across the inference stack.
Publish and share learnings through internal documentation, open-source libraries, or technical reports to advance scalable AI infrastructure.

Requirements

Bachelor's degree or equivalent experience in computer science, engineering, or a related field.
Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their system architectures.
Experience with inference serving systems optimized for throughput and latency (e.g., SGLang, vLLM).
Ability to thrive in a highly collaborative environment with cross-functional partners.
Proactive and initiative-driven mindset to work across different stacks and teams.
Strong engineering skills with the ability to contribute performant, maintainable code and debug complex codebases.
Experience training or supporting large-scale language models (preferred).
Understanding of distributed compute systems, GPU parallelism, and hardware-aware optimizations (preferred).
Contributions to open-source ML or systems infrastructure projects (e.g., SGLang, vLLM, PyTorch, Triton, DeepSpeed, XLA) (preferred).
Track record of improving research productivity through infrastructure design or process improvements (preferred).

Benefits

Generous health, dental, and vision benefits
Unlimited PTO
Paid parental leave
Relocation support

About thinkingmachines

View company profile