Infrastructure Engineer, Security

$200k - $475k • San Francisco

Posted 1mo ago

Job Location

San Francisco

Tech Stack

OpenAI Mistral Kubernetes Python Rust PyTorch Terraform CI/CD GPU IAM mTLS TPU VPC

Remote Work Policy

On-site

About the job

Thinking Machines Lab is seeking an infrastructure engineer to lead and enhance the security infrastructure for their foundation models. This role involves working across compute, storage, networking, and data platforms to ensure systems are secure, reliable, and scalable. The engineer will define security controls, architecture, and tooling, integrating security by default into the platform. Collaboration with research and product teams will be key to enabling rapid progress while maintaining robust protection for models, data, and environments.

Responsibilities

Architect security patterns for platforms and services, including network segmentation, service-to-service authentication, RBAC, and policy enforcement in Kubernetes and cloud environments.
Manage identity, access, and secrets for humans and services, covering workload and cross-cloud identity, least-privilege IAM, and secrets management.
Build secure platforms for data ingestion, processing, and curation, implementing classification, encryption, access controls, and safe sharing patterns.
Develop threat models and review designs with researchers and engineers to ensure safe and scalable feature and experiment shipping.
Automate security checks and establish guardrails through policy-as-code, secure infrastructure baselines, CI/CD validation, and user-friendly security tools.

Requirements

Bachelor’s degree or equivalent experience in engineering or a related field.
Strong background in containers and orchestration (e.g., Kubernetes) and their security (namespaces, network policies, pod security, admission controls).
Practical experience with Infrastructure as Code (Terraform or similar) for provisioning networks, IAM, and shared services.
Solid understanding of cloud networking and security concepts (VPCs, load balancers, service discovery, mTLS, firewalls, zero-trust architectures).
Proficiency in a systems language like Rust and scripting in Python for platform components and tools.
Demonstrated experience owning complex, production-critical systems and debugging cross-layer issues.
Experience with ML infrastructure, GPU clusters, or large-scale training environments is preferred.
Background in AI labs, HPC environments, or ML-heavy organizations is preferred.
Experience profiling and tuning high-throughput systems is preferred.
Familiarity with securing specialized hardware (GPUs, TPUs) and their integration into pipelines is preferred.

Benefits

Generous health, dental, and vision benefits
Unlimited PTO
Paid parental leave
Relocation support

About thinkingmachines

View company profile