Site Reliability Engineer (SRE)

$350k - $475k • San Francisco

Posted 1mo ago

Job Location

San Francisco

Tech Stack

OpenAI Mistral Kubernetes Fine-Tuning PyTorch Distributed Training CI/CD GPU Cloud Infrastructure distributed systems Observability Incident Response LoRA

Remote Work Policy

On-site

About the job

Thinking Machines Lab is seeking a Site Reliability Engineer (SRE) to ensure the end-to-end reliability of their Tinker platform. This role involves working closely with engineers and research teams to enhance the robustness and resilience of every system layer. The SRE will be instrumental in maintaining and improving the infrastructure that supports custom AI model fine-tuning, ensuring a seamless experience for researchers and developers.

Responsibilities

Define and own end-to-end reliability, including CI/CD, production observability, and incident response.
Develop Service Level Objectives for distributed training systems, balancing reliability and latency with development speed.
Design and implement monitoring and observability across the entire training pipeline.
Lead incident response for platform issues, ensuring quick recovery, thorough reviews, and preventative improvements.
Harden multi-tenant isolation and resource scheduling for efficient co-scheduling of LoRA-based workloads without compromising reliability or data separation.
Collaborate with security teams to address production vulnerabilities.

Requirements

Bachelor's degree or equivalent experience in computer science, engineering, or a related field.
Experience in distributed systems, cloud infrastructure, or site reliability engineering.
Proficiency in writing software for reliability, including tooling and automation.
Experience with production incident response, postmortems, and systematic reliability improvement.
Strong communication and coordination skills across engineering and research teams.
Deep experience operating production cloud services at scale (preferred).
Background in distributed training frameworks and their infrastructure failure modes (preferred).
Track record of building checkpoint and recovery systems for long-running distributed jobs (preferred).
Expertise in operating Kubernetes at scale, including deploying, managing, debugging, and tuning clusters for heterogeneous GPU workloads (preferred).

Benefits

Generous health, dental, and vision benefits
Unlimited PTO
Paid parental leave
Relocation support

About thinkingmachines

View company profile