Software Engineer, Data Infrastructure

$350k - $475k • San Francisco

Posted 1mo ago

Job Location

San Francisco

Tech Stack

OpenAI Mistral Python Rust PyTorch Terraform Airflow Spark Kafka Ray dbt Parquet Delta Lake

Remote Work Policy

On-site

About the job

Thinking Machines Lab is seeking an engineer to join a high-impact team focused on data infrastructure. This role is crucial for architecting and scaling the core systems that power distributed training pipelines, multimodal data catalogs, and intelligent processing of petabytes of data. You will work directly with researchers to accelerate experiments, develop new datasets, enhance infrastructure efficiency, and derive key insights from our data assets. If you are passionate about distributed systems, large-scale data mining, and building foundational tools from the ground up, we encourage you to apply.

Responsibilities

Design, build, and operate scalable, fault-tolerant infrastructure for LLM Research, including distributed compute, data orchestration, and storage across modalities.
Develop high-throughput systems for data ingestion, processing, and transformation, covering training data catalogs, deduplication, quality checks, and search.
Build systems for traceability, reproducibility, and robust quality control throughout the data lifecycle.
Implement and maintain monitoring and alerting systems to ensure platform reliability and performance.
Collaborate with research teams to enable new features, improve data quality, and expedite training cycles.

Requirements

Bachelor's degree or equivalent experience in computer science, engineering, or a related field.
Proficiency in at least one backend language, such as Python or Rust.
Fluency in distributed compute frameworks like Apache Spark or Ray.
Deep familiarity with cloud infrastructure, data lake architectures, and batch/streaming pipelines.
Comfort operating across the full stack and owning projects end-to-end.
Ability to thrive in a highly collaborative environment with cross-functional partners and subject matter experts.
A proactive approach with a bias for action to drive initiatives across different stacks and teams.
Hands-on experience with Kafka, dbt, Terraform, and Airflow is preferred.
Experience building a web crawler is a plus.
Extensive experience in scaling deduplication, data mining, and search is beneficial.
Strong knowledge of file formats and storage systems (e.g., Parquet, Delta Lake) and their impact on performance and scalability.
Proactive about documentation, testing, and empowering teammates with good tooling.

Benefits

Generous health, dental, and vision benefits
Unlimited PTO
Paid parental leave
Relocation support

About thinkingmachines

View company profile