Software Engineer, Systems Generalist
$350k - $475k • San Francisco
Posted 2mo ago
Job Location
San Francisco
Tech Stack
Remote Work Policy
On-site
Categories
AI Infrastructure Engineer
About the job
Thinking Machines Lab is seeking generalist infrastructure and systems engineers to build the core systems powering their foundation models and support internal research and product development teams. This high-impact role involves architecting and scaling critical infrastructure across the full technical stack, solving complex distributed systems problems, and building robust, scalable platforms. You will work directly with researchers to accelerate experiments, improve infrastructure efficiency, and enable key insights across models, products, and data assets.
Responsibilities
- Architect and scale core infrastructure for foundation models.
- Build and maintain data systems, including designing and optimizing data pipelines using tools like Spark.
- Develop tooling, systems, and frameworks to enhance research and engineering productivity.
- Support teams training, researching, and serving AI models.
- Build infrastructure for large-scale GPU clusters and Kubernetes environments.
- Embed governance best practices into scalable, reliable data infrastructure.
Requirements
- Bachelor's degree or equivalent experience in computer science, engineering, or a related field.
- Proficiency in at least one backend language, such as Python or Rust.
- Experience operating large-scale clusters and container orchestration systems (e.g., Kubernetes or Slurm).
- Comfort operating across the full technology stack and owning projects end-to-end.
- Ability to thrive in a highly collaborative environment with cross-functional partners.
- A proactive approach to identifying and addressing opportunities for improvement.
- Strong debugging skills across application, OS, and network layers.
- Proficiency in containers and modern CI/CD practices.
- Experience with Kubernetes, controllers/operators, or performance profiling.
- Familiarity with GPU/ML workflows or large-scale data/eval pipelines.
Benefits
- Generous health, dental, and vision benefits
- Unlimited PTO
- Paid parental leave
- Relocation support