Software Engineer, Systems Generalist

$350k - $475k San Francisco

Posted 2mo ago

Remote Work Policy

On-site

Categories

AI Infrastructure Engineer

About the job

Thinking Machines Lab is seeking generalist infrastructure and systems engineers to build the core systems powering their foundation models and support internal research and product development teams. This high-impact role involves architecting and scaling critical infrastructure across the full technical stack, solving complex distributed systems problems, and building robust, scalable platforms. You will work directly with researchers to accelerate experiments, improve infrastructure efficiency, and enable key insights across models, products, and data assets.

Responsibilities

  • Architect and scale core infrastructure for foundation models.
  • Build and maintain data systems, including designing and optimizing data pipelines using tools like Spark.
  • Develop tooling, systems, and frameworks to enhance research and engineering productivity.
  • Support teams training, researching, and serving AI models.
  • Build infrastructure for large-scale GPU clusters and Kubernetes environments.
  • Embed governance best practices into scalable, reliable data infrastructure.

Requirements

  • Bachelor's degree or equivalent experience in computer science, engineering, or a related field.
  • Proficiency in at least one backend language, such as Python or Rust.
  • Experience operating large-scale clusters and container orchestration systems (e.g., Kubernetes or Slurm).
  • Comfort operating across the full technology stack and owning projects end-to-end.
  • Ability to thrive in a highly collaborative environment with cross-functional partners.
  • A proactive approach to identifying and addressing opportunities for improvement.
  • Strong debugging skills across application, OS, and network layers.
  • Proficiency in containers and modern CI/CD practices.
  • Experience with Kubernetes, controllers/operators, or performance profiling.
  • Familiarity with GPU/ML workflows or large-scale data/eval pipelines.

Benefits

  • Generous health, dental, and vision benefits
  • Unlimited PTO
  • Paid parental leave
  • Relocation support

About thinkingmachines

Get new AI jobs in your inbox

A weekly digest of the newest LLM, RAG, and AI agent engineering roles.

© 2026 AI Job Board. All rights reserved.