Site Reliability Engineer (SRE)

$350k - $475k San Francisco

Posted 1mo ago

About the job

Thinking Machines Lab is seeking a Site Reliability Engineer (SRE) to ensure the end-to-end reliability of their Tinker platform. This role involves working closely with engineers and research teams to enhance the robustness and resilience of every system layer. The SRE will be instrumental in maintaining and improving the infrastructure that supports custom AI model fine-tuning, ensuring a seamless experience for researchers and developers.

Responsibilities

  • Define and own end-to-end reliability, including CI/CD, production observability, and incident response.
  • Develop Service Level Objectives for distributed training systems, balancing reliability and latency with development speed.
  • Design and implement monitoring and observability across the entire training pipeline.
  • Lead incident response for platform issues, ensuring quick recovery, thorough reviews, and preventative improvements.
  • Harden multi-tenant isolation and resource scheduling for efficient co-scheduling of LoRA-based workloads without compromising reliability or data separation.
  • Collaborate with security teams to address production vulnerabilities.

Requirements

  • Bachelor's degree or equivalent experience in computer science, engineering, or a related field.
  • Experience in distributed systems, cloud infrastructure, or site reliability engineering.
  • Proficiency in writing software for reliability, including tooling and automation.
  • Experience with production incident response, postmortems, and systematic reliability improvement.
  • Strong communication and coordination skills across engineering and research teams.
  • Deep experience operating production cloud services at scale (preferred).
  • Background in distributed training frameworks and their infrastructure failure modes (preferred).
  • Track record of building checkpoint and recovery systems for long-running distributed jobs (preferred).
  • Expertise in operating Kubernetes at scale, including deploying, managing, debugging, and tuning clusters for heterogeneous GPU workloads (preferred).

Benefits

  • Generous health, dental, and vision benefits
  • Unlimited PTO
  • Paid parental leave
  • Relocation support

About thinkingmachines

Get new AI jobs in your inbox

A weekly digest of the newest LLM, RAG, and AI agent engineering roles.

© 2026 AI Job Board. All rights reserved.