Together AI

AI Infrastructure Engineer

?

Unknown company· San Francisco

As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline, and mature automation to our operating environments and codebase. You specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability and scalability, with varied interests in algorithms and distributed systems. Responsibilities Participate in on-call rotation (Pagerduty) to respond to production incidents Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users Build monitoring systems to ensure the highest quality service for our customers Design and implement operational processes (such as deployments and upgrades) Debug production issues across all services and levels of the stack Identify improvements for the product architecture from the reliability, performance and availability perspectives Plan the growth of Together AI's infrastructure Requirements 5+ years of professional AI Infra or related experience Bachelor's degree in Computer Science or a related field or equivalent work experience Knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes Proficiency in programming/scripting languages Direct experience in monitoring and observability practices Knowledge of cloud services Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $190,000 - $270,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

18h ago

Systems Research Engineer Intern - GPU Programming (Fall 2026)

?

Unknown company· San Francisco

About The Role As a Systems Research Engineer Intern specialized in GPU Programming, you will play a crucial role in developing and optimizing GPU-accelerated kernels and algorithms for ML/AI applications. Working closely with the modeling and algorithm team, you will co-design GPU kernels and model architecture to enhance the performance and efficiency of our AI systems. Collaborating with the hardware and software teams, you will contribute to the co-design of efficient GPU architectures and programming models, leveraging your expertise in GPU programming and parallel computing. Your research skills will be vital in staying up-to-date with the latest advancements in GPU programming techniques, ensuring that our AI infrastructure remains at the forefront of innovation. Responsibilities Optimize and fine-tune GPU code to achieve better performance and scalability Collaborate with cross-functional teams to integrate GPU-accelerated solutions into existing software systems Stay up-to-date with the latest advancements in GPU programming techniques and technologies Requirements Strong background in GPU programming and parallel computing, such as CUDA and/or Triton. Knowledge of ML/AI applications and models Knowledge of performance profiling and optimization tools for GPU programming Excellent problem-solving and analytical skills Internship Program Details Our fall internship program spans over 12 to 16 weeks where you’ll have the opportunity to work with industry-leading engineers building a cloud from the ground up and possibly contribute to influential open source projects. Our internship dates are September 14th to December 18th. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancements such as FlashAttention, Mamba, FlexGen, Petals, Mixture of Agents, and RedPajama. Compensation We offer competitive compensation, housing stipends, and other competitive benefits. The estimated US hourly rate for this role is $58 to $63. Our hourly rates are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Systems Research Engineer, GPU Programming

?

Unknown company· San Francisco

About the Role As a Systems Research Engineer specialized in GPU Programming, you will play a crucial role in developing and optimizing GPU-accelerated kernels and algorithms for ML/AI applications. Working closely with the modeling and algorithm team, you will co-design GPU kernels and model architecture to enhance the performance and efficiency of our AI systems. Collaborating with the hardware and software teams, you will contribute to the co-design of efficient GPU architectures and programming models, leveraging your expertise in GPU programming and parallel computing. Your research skills will be vital in staying up-to-date with the latest advancements in GPU programming techniques, ensuring that our AI infrastructure remains at the forefront of innovation. Requirements Strong background in GPU programming and parallel computing, such as CUDA and/or Triton. Knowledge of ML/AI applications and models Knowledge of performance profiling and optimization tools for GPU programming Excellent problem-solving and analytical skills Bachelor's, Master's, or Ph.D. degree in Computer Science, Electrical Engineering, or equivalent practical experiences Responsibilities Optimize and fine-tune GPU code to achieve better performance and scalability Collaborate with cross-functional teams to integrate GPU-accelerated solutions into existing software systems Stay up-to-date with the latest advancements in GPU programming techniques and technologies About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Staff Machine Learning Engineer, Voice AI

?

Unknown company· San Francisco

About the Role Together AI is building the best inference infrastructure for voice applications. Our Voice AI platform powers production-grade, real-time voice agents and applications — serving speech-to-text and text-to-speech models with best-in-class latency and reliability. We're looking for a Staff ML Engineer to drive the model serving layer for voice workloads. You'll work hands-on with inference engines like TRT-LLM and SGLang to optimize how we serve models like Whisper, Parakeet, Orpheus, and Kokoro — pushing latency and throughput to the frontier. You'll profile GPU utilization, design batching strategies for streaming audio, and ensure new model architectures can go from research to production quickly. This is a foundational hire on a small, high-impact team. Voice inference has unique challenges — streaming audio, tokenization, real-time latency budgets — that require dedicated ML engineering focus. You'll shape how Together serves voice models as the industry moves from pipeline architectures (ASR → LLM → TTS) toward end-to-end speech-to-speech. Own the model serving stack that powers Together's voice platform across STT, TTS, and speech-to-speech. Work directly with state-of-the-art accelerators (H100s, H200s, B200s) to optimize voice model inference. Collaborate with model partners (Cartesia, Deepgram, Rime, and others) to bring their models to production on Together's infrastructure. Build quality evaluation frameworks that guide model selection for customers and inform the roadmap. Join a small, early-stage team with outsized impact on a fast-growing product area. Responsibilities Own the voice inference roadmap end-to-end — define and execute the technical strategy for optimizing STT, TTS, and speech-to-speech models across Together's infrastructure, with a clear-eyed view of where the field is heading and how to position the platform ahead of it. Drive best-in-class inference performance — architect and implement systems targeting leading TTFB, throughput, and GPU utilization for voice workloads; set the performance bar others in the industry measure against, not just catch up to. Lead productionization of voice models at scale — design the serving architecture for serverless and dedicated endpoints, including batching strategies, streaming inference pipelines, and memory management tailored to real-time audio; own reliability and latency SLAs. Build the voice evaluation platform — design a rigorous, extensible evaluation framework covering WER across accents, languages, and noise conditions for STT; naturalness, latency, and pronunciation fidelity for TTS; establish the internal benchmark methodology that informs model selection and roadmap decisions. Shape the architecture for next-generation model support — anticipate and enable emerging model paradigms — audio-native LLMs, codec-based architectures (SNAC, Encodec), and end-to-end speech-to-speech systems — before they're mainstream, not after. Serve as the technical DRI for model partner integrations — lead deep collaboration with partners such as Cartesia, Deepgram, and Rime; own the full lifecycle from integration to optimization to ongoing performance accountability. Diagnose and resolve the hardest performance problems in the stack — conduct systematic profiling and root-cause analysis from GPU kernel behavior to framework-level bottlenecks; drive shipped improvements with documented, measurable impact. Influence platform architecture across the organization — partner with platform engineering leadership to ensure the serving layer is built for the latency and reliability demands of real-time voice APIs; your technical decisions should raise the ceiling for the whole team. Define and scale voice fine-tuning capabilities — lead the technical direction for enabling customers to fine-tune STT and TTS models on Together's infrastructure, establishing the primitives for differentiated voice experiences. Lay technical foundations for a category-defining product surface — architect systems with enough foresight that they support multiple new voice products with minimal rework; think in terms of platforms, not point solutions. Requirements 8+ years of ML engineering experience, with a demonstrated focus on model serving, inference optimization, or ML infrastructure at production scale — including systems you've owned from design through live traffic. Deep, practical expertise in LLM serving engines (vLLM, SGLang, TensorRT-LLM, or equivalent) — you've modified engine internals, debugged edge cases under load, and contributed improvements back; you don't stop at the API surface. Expert-level Python and PyTorch proficiency, with a strong command of GPU optimization — CUDA kernels, memory hierarchies, profiling toolchains — and a track record of turning that knowledge into shipped latency or throughput wins. Proven system design judgment — you've made architectural decisions that held up at scale and influenced how a team or platform evolved; you can articulate the tradeoffs you made and why. Strong technical leadership — you operate with high autonomy, define the right problems before solving them, and raise the bar for engineering quality around you without requiring process overhead. Sharp product intuition for developer tooling — you understand what voice application developers actually need to ship great products, and you let that shape your technical priorities, not just the other way around. Proven ability to move fast in ambiguous environments — you've thrived on early-stage or platform teams where scope is wide, ownership is deep, and the roadmap you build is the one you execute. Strong foundation in speech and audio ML (ASR/TTS architectures, audio signal processing) — directly relevant experience is strongly preferred; exceptional ML engineering fundamentals with genuine curiosity about the domain is also considered. Familiarity with audio codec and tokenization schemes (SNAC, Encodec, DAC) is a meaningful plus at this level. Experience training or fine-tuning speech models at scale is a significant advantage. Bachelor's or Master's in Computer Science, Electrical Engineering, or related field — or equivalent depth demonstrated through your work. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $220,000 - $280,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

?

Unknown company· San Francisco

About the Role In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing. You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads. Responsibilities Design multi-petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing). Design/optimize RDMA, InfiniBand, 400GbE networks; tune for max throughput/min latency; implement NVMe-oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage. Build Kubernetes storage operators/controllers; enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas; create reusable Helm/Terraform patterns. Deliver 10-50 GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes. Build multi-tier caches (local NVMe, distributed, object); optimize data locality and model-weight distribution; implement smart prefetching/eviction. Implement monitoring, alerting, SLOs; design DR/backups with runbooks; run chaos engineering; ensure 99.9%+ uptime via proactive/automated remediation. Partner with ML/SRE teams; mentor on storage best practices; contribute to open-source; write docs, postmortems, and public learnings. Requirements 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale Proven track record deploying and operating high-performance storage for GPU/HPC clusters Deep Kubernetes and cloud-native storage experience in production environments Strong coding skills in Go and Python with demonstrated ability to build production-grade tools BS/MS in Computer Science, Engineering, or equivalent practical experience History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput) Programming: Go and Python for automation, operators, and tooling Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD) Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations Observability: Prometheus, Grafana, Thanos architecture and operations Nice to Have Skills GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE) ML/AI storage patterns (model weights, checkpointing, dataset caching) Kubernetes operator development (controller-runtime, kubebuilder) Storage snapshots, cloning, and thin provisioning Backup and disaster recovery (Velero, Restic, cross-region replication) Storage encryption (at-rest and in-transit), security and compliance Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace) About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $250,000 - $300,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Senior Machine Learning Engineer, Voice AI

?

Unknown company· San Francisco

About the Role Together AI is building the best inference infrastructure for voice applications. Our Voice AI platform powers production-grade, real-time voice agents and applications — serving speech-to-text and text-to-speech models with best-in-class latency and reliability. We're looking for a Senior ML Engineer to drive the model serving layer for voice workloads. You'll work hands-on with inference engines like TRT-LLM and SGLang to optimize how we serve models like Whisper, Parakeet, Orpheus, and Kokoro — pushing latency and throughput to the frontier. You'll profile GPU utilization, design batching strategies for streaming audio, and ensure new model architectures can go from research to production quickly. This is a foundational hire on a small, high-impact team. Voice inference has unique challenges — streaming audio, tokenization, real-time latency budgets — that require dedicated ML engineering focus. You'll shape how Together serves voice models as the industry moves from pipeline architectures (ASR → LLM → TTS) toward end-to-end speech-to-speech. Own the model serving stack that powers Together's voice platform across STT, TTS, and speech-to-speech. Work directly with state-of-the-art accelerators (H100s, H200s, B200s) to optimize voice model inference. Collaborate with model partners (Cartesia, Deepgram, Rime, and others) to bring their models to production on Together's infrastructure. Build quality evaluation frameworks that guide model selection for customers and inform the roadmap. Join a small, early-stage team with outsized impact on a fast-growing product area. Responsibilities Optimize inference performance for voice models (STT, TTS, speech-to-speech) — targeting best-in-class TTFB, throughput, and GPU utilization across our curated model set. Productionize voice models on serverless and dedicated endpoints, including batching strategies, streaming inference, and memory management tailored to audio workloads. Build and maintain a voice model evaluation framework — measuring WER across accents, languages, and noise conditions for STT; naturalness, latency, and pronunciation accuracy for TTS. Enable new model architectures in our serving stack as the field evolves, including audio-native LLMs, codec-based models (SNAC), and speech-to-speech systems. Collaborate with model partners to integrate and optimize their models (Cartesia, Deepgram, Rime, and others) running on Together's infrastructure. Profile and debug performance across the full inference stack — from GPU kernels to framework-level bottlenecks — and ship measurable improvements. Work with the platform engineering side of the team to ensure the serving layer meets the latency and reliability requirements of real-time voice APIs. Contribute to voice model fine-tuning capabilities (STT and TTS) as we enable customers to build differentiated voice experiences on Together. Lay the groundwork for multiple new products down the line. Requirements 5+ years of experience in ML engineering, with a focus on model serving, inference optimization, or ML infrastructure. Hands-on experience with LLM serving engines (vLLM, SGLang, TensorRT-LLM, or similar) — comfortable reading and modifying engine internals, not just using APIs. Strong proficiency in Python and PyTorch; experience with GPU profiling and optimization (CUDA, memory management, kernel-level debugging). Track record of shipping ML systems to production with measurable performance improvements. Strong product sense — you think about what developers building voice apps actually need, not just what's technically interesting. Comfort working on a small, early-stage team where you'll wear multiple hats and move fast. Experience with speech and audio ML (ASR, TTS architectures, audio signal processing) is a strong plus but not required — you can learn this quickly if you have strong ML engineering fundamentals. Familiarity with audio codecs and tokenization schemes (SNAC, Encodec, DAC) is a plus. Experience training or fine-tuning speech models is a plus. Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field, or equivalent practical experience About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $200,000 - $260,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Research Engineer, Frontier Speculative Decoding

?

Unknown company· San Francisco, New York City

About the Role Together AI is building the Inference Platform that powers the world's most advanced generative AI models. Your role will be a critical bridge between cutting-edge research and real-world applications, focusing on making translating our internal model training research to production-ready deployment for our customers. This involves a deep commitment to data-centric development, meticulous hyperparameter tuning, and rigorous checkpoint evaluation before models ever hit production. This role will involve understanding customer specific needs and fine-tuning models on our internal data recipe and their proprietary data. The goal is to transform general-purpose models into highly performant, specialized tools that solve real business problems. You will not be training foundation models from scratch but rather focusing on creating highly efficient, specialized models by working with dedicated GPU clusters. Responsibilities Design and iterate on novel speculator algorithms, combining architectural innovations with carefully curated data to push the frontier of accuracy–efficiency tradeoffs. Be the critical link between raw data and a production-ready model, seeing your work directly impact our customers' success. Work in a fast-paced, high-impact role at the cutting edge of generative AI. Collaborate with a team of experts dedicated to solving real-world, high-performance challenges. You'll collaborate directly with customers to understand their needs, and work closely with our core inference and Applied ML research teams to integrate your work into the production platform. A culture of deep technical ownership where you are empowered to take on and solve challenging problems Requirements A genuine love for data curation and processing, with a meticulous attention to detail. You believe that great models start with great data. Demonstrated ability to perform effective hyperparameter searches and understand the trade-offs involved in tuning models for specific tasks. Experience working with and building on top of existing training codebases. You are comfortable navigating complex code and contributing to its improvement. Strong attention-to-detail in evaluating model checkpoints to ensure they meet strict quality, performance, and reliability standards. Experience with Python and PyTorch. Familiarity with SLURM and/or Kubernetes clusters and experience submitting and managing jobs in a high-performance computing environment. Familiarity with modern LLMs and generative models. Basic understanding of distributed training frameworks (e.g., FSDP, DeepSpeed). Bachelor’s, Master’s degree, or Ph.D. in Computer Science, Computer Engineering, or a related field, or equivalent practical experience. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, ATLAS, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $190,000 - $270,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Research Engineer, Core ML

?

Unknown company· San Francisco

About the Role This is a research engineering role with direct production impact. You won’t be publishing ideas in isolation—you will translate new RL algorithms, scheduling methods, and inference optimizations into production-grade systems that power Together’s API. Success in this role means shipping measurable improvements in latency, throughput, cost, and model quality at scale. We are looking for researchers who enjoy owning systems end-to-end and turning frontier ideas into robust infrastructure. The Core ML (Turbo) at Together AI team sits at the intersection of efficient inference (algorithms, architectures, engines) and post‑training / RL systems. We build and operate the systems behind Together’s API, including high‑performance inference and RL/post‑training engines that can run at production scale. Our mandate is to push the frontier of efficient inference and RL‑driven training: making models dramatically faster and cheaper to run, while improving their capabilities through RL‑based post‑training (e.g., GRPO‑style objectives). This work lives at the interface of algorithms and systems: asynchronous RL, rollout collection, scheduling, and batching all interact with engine design, creating many knobs to tune across the RL algorithm, training loop, and inference stack. Much of the job is modifying production inference systems—for example, SGLang‑ or vLLM‑style serving stacks and speculative decoding systems such as ATLAS—grounded in a strong understanding of post‑training and inference theory, rather than purely theoretical algorithm design. You’ll work across the stack—from RL algorithms and training engines to kernels and serving systems—to build and improve frontier models via RL pipelines. People on this team are often spiky: some are more RL‑first, some are more systems‑first. Depth in one of these areas plus appetite to collaborate across (and grow toward more full‑stack ownership over time) is ideal. Responsibilities Advance inference efficiency end‑to‑end Design and prototype algorithms, architectures, and scheduling strategies for low‑latency, high‑throughput inference. Implement and maintain changes in high‑performance inference engines (e.g., SGLang‑ or vLLM‑style systems and Together’s inference stack), including kernel backends, speculative decoding (e.g., ATLAS), quantization, etc. Profile and optimize performance across GPU, networking, and memory layers to improve latency, throughput, and cost. Unify inference with RL / post‑training Design and operate RL and post‑training pipelines (e.g., RLHF, RLAIF, GRPO, DPO‑style methods, reward modeling) where 90+% of the cost is inference, jointly optimizing algorithms and systems. Make RL and post‑training workloads more efficient with inference‑aware training loops—for example, async RL rollouts, speculative decoding, and other techniques that make large‑scale rollout collection and evaluation cheaper. Use these pipelines to train, evaluate, and iterate on frontier models on top of our inference stack. Co‑design algorithms and infrastructure so that objectives, rollout collection, and evaluation are tightly coupled to efficient inference, and quickly identify bottlenecks across the training engine, inference engine, data pipeline, and user‑facing layers. Run ablations and scale‑up experiments to understand trade‑offs between model quality, latency, throughput, and cost, and feed these insights back into model, RL, and system design. Own critical systems at production scale Profile, debug, and optimize inference and post-training services under real production workloads, taking research ideas all the way to stable, measurable improvements in deployed systems. Drive roadmap items that require real engine modification—changing kernels, memory layouts, scheduling logic, and APIs as needed. Establish metrics, benchmarks, and experimentation frameworks to validate improvements rigorously. Provide technical leadership (Staff level) Set technical direction for cross‑team efforts at the intersection of inference, RL, and post‑training. Mentor other engineers and researchers on full‑stack ML systems work and performance engineering. Requirements We don’t expect anyone to check every box below. People on this team typically have deep expertise in one or more areas and enough breadth (or interest) to work effectively across the stack. The closer you are to full‑stack (inference + post‑training/RL + systems), the stronger the fit—but being spiky in one area and eager to grow is absolutely okay. You might be a good fit if you: Have a bias toward implementation and shipping —you are excited to modify real engines and services, not just prototype in research code. Have strong expertise in at least one of the following, and are excited to collaborate across (and grow into) the others: Systems‑first profile: Large‑scale inference systems (e.g., SGLang, vLLM, FasterTransformer, TensorRT, custom engines, or similar), GPU performance, distributed serving. RL‑first profile: RL / post‑training for LLMs or large models (e.g., GRPO, RLHF/RLAIF, DPO‑like methods, reward modeling), and using these to train or fine‑tune real models. Model architecture design for Transformers or other large neural nets. Distributed systems / high‑performance computing for ML. Are comfortable working from algorithms to engines: Strong coding ability in Python Experience profiling and optimizing performance across GPU, networking, and memory layers. Able to take a new sampling method, scheduler, or RL update and turn it into a production‑grade implementation in the engine and/or training stack. Have a solid research foundation in your area(s) of depth: Track record of impactful work in ML systems, RL, or large‑scale model training (papers, open‑source projects, or production systems). Can read new RL / post‑training papers, understand their implications on the stack, and design minimal, correct changes in the right layer (training engine vs. inference engine vs. data / API). Operate well as a full‑stack problem solver: You naturally ask: “Where in the stack is this really bottlenecked?” You enjoy collaborating with infra, research, and product teams, and you care about both scientific quality and user‑visible wins. Minimum qualifications 3+ years of experience working on ML systems, large‑scale model training, inference, or adjacent areas (or equivalent experience via research / open source). Advanced degree in Computer Science, EE, or a related field, or equivalent practical experience. Demonstrated experience owning complex technical projects end‑to‑end. If you’re excited about the role and strong in some of these areas, we encourage you to apply even if you don’t meet every single requirement. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $200,000 - $280,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Machine Learning, Platform Engineer

?

Unknown company· San Francisco

About the Role Our team focuses on enabling custom models and dedicated inference on Together. We are responsible for building a container platform, optimizing autoscaling, minimizing cold starts, achieving the best end-to-end model performance, and providing a best-in-class developer experience with great tooling. We often focus on video or audio generation across the stack: CUDA kernels, pytorch optimization, inference engines, container orchestration, queueing theory, etc. An ideal candidate will be great at profiling/optimization but know the word kubernetes, or be intimately familiar with multi-cluster scheduling and have some sense of ML bottlenecks. Responsibilities New hires may work on multi-cluster orchestration, portfolio optimization, predictive autoscaling, control panes, model bring-up, model optimization, APIs for managing deployments, inference worker SDKs, and CLI tools. Analyze and improve the robustness and scalability of existing distributed systems, APIs, databases, and infrastructure Partner with product teams to understand functional requirements and deliver solutions that meet business needs Write clear, well-tested, and maintainable software and IaC for both new and existing systems Conduct design and code reviews, create developer documentation, and develop testing strategies for robustness and fault tolerance Requirements 5+ years of demonstrated experience in building large scale, fault tolerant, distributed systems. Experience running serverless inference platforms, doing model bring-up on short notice, being on call, or running a cloud provider is a very big plus Good taste and ability to thoughtfully discuss how what you’ve built has failed over time Experience designing, analyzing and improving efficiency, scalability, and stability of various system resources Excellent understanding of low level operating systems concepts including concurrency, networking and storage, performance and scale Expert-level programmer in one or more of Python, Golang, Rust, C++, or Haskell Proficiency in writing and maintaining Infrastructure as Code (IaC) using tools like Terraform Experience with Kubernetes internals or other container orchestration systems Sound judgement for when to use and when to not use LLMs for code Bachelor’s or Master’s degree in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience Writing-heavy roles or companies are a plus About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $250,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Machine Learning Engineer - Inference

?

Unknown company· San Francisco

About the Role Together AI is seeking a Machine Learning Engineer to join our Inference Engine team, focusing on optimizing and enhancing the performance of our AI inference systems. This role involves working with state-of-the-art large language models models and ensuring they run efficiently and effectively at scale. If you are passionate about AI inference, PyTorch, and developing high-performance systems, we want to hear from you. This position offers the chance to collaborate closely with AI researchers and engineers to create cutting-edge AI solutions. Join us in shaping the future at Together AI! Responsibilities Design and build the production systems that power the Together AI inference engine, enabling reliability and performance at scale. Develop and optimize runtime inference services for large-scale AI applications. Collaborate with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world. Conduct design and code reviews to ensure high standards of quality. Create services, tools, and developer documentation to support the inference engine. Implement robust and fault-tolerant systems for data ingestion and processing. Requirements 3+ years of experience writing high-performance, well-tested, production-quality code. Proficiency with Python and PyTorch. Demonstrated experience in building high performance libraries and tooling. Excellent understanding of low-level operating systems concepts including multi-threading, memory management, networking, storage, performance, and scale. Preferred: Knowledge of existing AI inference systems such as TGI, vLLM, TensorRT-LLM, Optimum Preferred: Knowledge of AI inference techniques such as speculative decoding. Preferred: Knowledge of CUDA/Triton programming. Nice to have: Knowledge of Rust, Cython and compilers. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society. Together, we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI. Our team has been behind technological advancements such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey to build the next-generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other competitive benefits. The US base salary range for this full-time position is $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level, and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunities to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Machine Learning Engineer

?

Unknown company· San Francisco

About the Role Together AI is looking for an ML Engineer who will develop systems and APIs that enable our customers to perform inference and fine tune LLMs. Relevant experience includes implementing runtime systems that perform inference at scale using AI/ML models from simple models up to the largest LLMs. Requirements 5+ years experience writing high-performance, well-tested, production quality code Bachelor’s degree in computer science or equivalent industry experience Familiar with LLM inference ecosystem, including frameworks and engines (e.g. vLLM, SGLang, TRT, ...) Demonstrated experience in building large scale, fault tolerant, distributed systems like storage, search, and computation Expert level programmer in one or more of Python, Go, Rust, or C/C++ Experience implementing runtime inference services at scale or similar Responsibilities Design and build the production systems that power the Together Cloud inference and fine-tuning APIs, enabling reliability and performance at scale Partner with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world Analyze and improve efficiency, scalability, and stability of various system resources Conduct design and code reviews Create services, tools & developer documentation Create testing frameworks for robustness and fault-tolerance Participate in an on-call rotation to respond to critical incidents as needed About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $220,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

LLM Inference Frameworks and Optimization Engineer

?

Unknown company· San Francisco, Singapore, Amsterdam

About the Role At Together.ai, we are building state-of-the-art infrastructure to enable efficient and scalable inference for large language models (LLMs). Our mission is to optimize inference frameworks, algorithms, and infrastructure, pushing the boundaries of performance, scalability, and cost-efficiency. We are seeking an Inference Frameworks and Optimization Engineer to design, develop, and optimize distributed inference engines that support multimodal and language models at scale. This role will focus on low-latency, high-throughput inference, GPU/accelerator optimizations, and software-hardware co-design, ensuring efficient large-scale deployment of LLMs and vision models. This role offers a unique opportunity to shape the future of LLM inference infrastructure, ensuring scalable, high-performance AI deployment across a diverse range of applications. If you're passionate about pushing the boundaries of AI inference, we’d love to hear from you! Responsibilities Inference Framework Development and Optimization Design and develop fault-tolerant, high-concurrency distributed inference engine for text, image, and multimodal generation models. Implement and optimize distributed inference strategies, including Mixture of Experts (MoE) parallelism, tensor parallelism, pipeline parallelism for high-performance serving. Apply CUDA graph optimizations, TensorRT/TRT-LLM graph optimizations, and PyTorch-based compilation (torch.compile), and speculative decoding to enhance efficiency and scalability. Software-Hardware Co-Design and AI Infrastructure Collaborate with hardware teams on performance bottleneck analysis, co-optimize inference performance for GPUs, TPUs, or custom accelerators. Work closely with AI researchers and infrastructure engineers to develop efficient model execution plans and optimize E2E model serving pipelines. Requirements Must-Have: Experience: 3+ years of experience in deep learning inference frameworks, distributed systems, or high-performance computing. Technical Skills: Familiar with at least one LLM inference frameworks (e.g., TensorRT-LLM, vLLM, SGLang, TGI(Text Generation Inference)). Background knowledge and experience in at least one of the following: GPU programming (CUDA/Triton/TensorRT), compiler, model quantization, and GPU cluster scheduling. Deep understanding of KV cache systems like Mooncake , PagedAttention , or custom in-house variants. Programming: Proficient in Python and C++/CUDA for high-performance deep learning inference. Optimization Techniques: Deep understanding of Transformer architectures and LLM/VLM/Diffusion model optimization. Knowledge of inference optimization, such as workload scheduling, CUDA graph, compiled, efficient kernels Soft Skills: Strong analytical problem-solving skills with a performance-driven mindset. Excellent collaboration and communication skills across teams. Nice-to-Have: Experience in developing software systems for large-scale data center networks with RDMA/RoCE Familiar with distributed filesystem(e.g., 3FS, HDFS, Ceph) Familiar with open source distributed scheduling/orchestration frameworks, such as Kubernetes (K8S) Contributions to open-source deep learning inference projects. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

AI infrastructure Engineer (SRE) Amsterdam

?

Unknown company· Amsterdam

As a AI Infrastructure Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline, and mature automation to our operating environments and codebase. You specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability and scalability, with varied interests in algorithms and distributed systems. Requirements 7+ years of professional SRE or related experience Bachelor's degree in Computer Science or a related field or equivalent work experience Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes Proficiency in programming/scripting languages Direct experience in monitoring and observability practices Advanced knowledge of cloud services Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts Responsibilities Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users Build monitoring systems to ensure the highest quality service for our customers Design and implement operational processes (such as deployments and upgrades) Debug production issues across all services and levels of the stack Identify improvements for the product architecture from the reliability, performance and availability perspectives Plan the growth of Together AI’s infrastructure About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

AI Infrastructure Engineer

?

Unknown company· San Francisco

As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline, and mature automation to our operating environments and codebase. You specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability and scalability, with varied interests in algorithms and distributed systems. Responsibilities Participate in on-call rotation (Pagerduty) to respond to production incidents Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users Build monitoring systems to ensure the highest quality service for our customers Design and implement operational processes (such as deployments and upgrades) Debug production issues across all services and levels of the stack Identify improvements for the product architecture from the reliability, performance and availability perspectives Plan the growth of Together AI's infrastructure Requirements 5+ years of professional AI Infra or related experience Bachelor's degree in Computer Science or a related field or equivalent work experience Knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes Proficiency in programming/scripting languages Direct experience in monitoring and observability practices Knowledge of cloud services Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $190,000 - $270,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

18h ago

Systems Research Engineer Intern - GPU Programming (Fall 2026)

?

Unknown company· San Francisco

About The Role As a Systems Research Engineer Intern specialized in GPU Programming, you will play a crucial role in developing and optimizing GPU-accelerated kernels and algorithms for ML/AI applications. Working closely with the modeling and algorithm team, you will co-design GPU kernels and model architecture to enhance the performance and efficiency of our AI systems. Collaborating with the hardware and software teams, you will contribute to the co-design of efficient GPU architectures and programming models, leveraging your expertise in GPU programming and parallel computing. Your research skills will be vital in staying up-to-date with the latest advancements in GPU programming techniques, ensuring that our AI infrastructure remains at the forefront of innovation. Responsibilities Optimize and fine-tune GPU code to achieve better performance and scalability Collaborate with cross-functional teams to integrate GPU-accelerated solutions into existing software systems Stay up-to-date with the latest advancements in GPU programming techniques and technologies Requirements Strong background in GPU programming and parallel computing, such as CUDA and/or Triton. Knowledge of ML/AI applications and models Knowledge of performance profiling and optimization tools for GPU programming Excellent problem-solving and analytical skills Internship Program Details Our fall internship program spans over 12 to 16 weeks where you’ll have the opportunity to work with industry-leading engineers building a cloud from the ground up and possibly contribute to influential open source projects. Our internship dates are September 14th to December 18th. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancements such as FlashAttention, Mamba, FlexGen, Petals, Mixture of Agents, and RedPajama. Compensation We offer competitive compensation, housing stipends, and other competitive benefits. The estimated US hourly rate for this role is $58 to $63. Our hourly rates are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Systems Research Engineer, GPU Programming

?

Unknown company· San Francisco

About the Role As a Systems Research Engineer specialized in GPU Programming, you will play a crucial role in developing and optimizing GPU-accelerated kernels and algorithms for ML/AI applications. Working closely with the modeling and algorithm team, you will co-design GPU kernels and model architecture to enhance the performance and efficiency of our AI systems. Collaborating with the hardware and software teams, you will contribute to the co-design of efficient GPU architectures and programming models, leveraging your expertise in GPU programming and parallel computing. Your research skills will be vital in staying up-to-date with the latest advancements in GPU programming techniques, ensuring that our AI infrastructure remains at the forefront of innovation. Requirements Strong background in GPU programming and parallel computing, such as CUDA and/or Triton. Knowledge of ML/AI applications and models Knowledge of performance profiling and optimization tools for GPU programming Excellent problem-solving and analytical skills Bachelor's, Master's, or Ph.D. degree in Computer Science, Electrical Engineering, or equivalent practical experiences Responsibilities Optimize and fine-tune GPU code to achieve better performance and scalability Collaborate with cross-functional teams to integrate GPU-accelerated solutions into existing software systems Stay up-to-date with the latest advancements in GPU programming techniques and technologies About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Staff Machine Learning Engineer, Voice AI

?

Unknown company· San Francisco

About the Role Together AI is building the best inference infrastructure for voice applications. Our Voice AI platform powers production-grade, real-time voice agents and applications — serving speech-to-text and text-to-speech models with best-in-class latency and reliability. We're looking for a Staff ML Engineer to drive the model serving layer for voice workloads. You'll work hands-on with inference engines like TRT-LLM and SGLang to optimize how we serve models like Whisper, Parakeet, Orpheus, and Kokoro — pushing latency and throughput to the frontier. You'll profile GPU utilization, design batching strategies for streaming audio, and ensure new model architectures can go from research to production quickly. This is a foundational hire on a small, high-impact team. Voice inference has unique challenges — streaming audio, tokenization, real-time latency budgets — that require dedicated ML engineering focus. You'll shape how Together serves voice models as the industry moves from pipeline architectures (ASR → LLM → TTS) toward end-to-end speech-to-speech. Own the model serving stack that powers Together's voice platform across STT, TTS, and speech-to-speech. Work directly with state-of-the-art accelerators (H100s, H200s, B200s) to optimize voice model inference. Collaborate with model partners (Cartesia, Deepgram, Rime, and others) to bring their models to production on Together's infrastructure. Build quality evaluation frameworks that guide model selection for customers and inform the roadmap. Join a small, early-stage team with outsized impact on a fast-growing product area. Responsibilities Own the voice inference roadmap end-to-end — define and execute the technical strategy for optimizing STT, TTS, and speech-to-speech models across Together's infrastructure, with a clear-eyed view of where the field is heading and how to position the platform ahead of it. Drive best-in-class inference performance — architect and implement systems targeting leading TTFB, throughput, and GPU utilization for voice workloads; set the performance bar others in the industry measure against, not just catch up to. Lead productionization of voice models at scale — design the serving architecture for serverless and dedicated endpoints, including batching strategies, streaming inference pipelines, and memory management tailored to real-time audio; own reliability and latency SLAs. Build the voice evaluation platform — design a rigorous, extensible evaluation framework covering WER across accents, languages, and noise conditions for STT; naturalness, latency, and pronunciation fidelity for TTS; establish the internal benchmark methodology that informs model selection and roadmap decisions. Shape the architecture for next-generation model support — anticipate and enable emerging model paradigms — audio-native LLMs, codec-based architectures (SNAC, Encodec), and end-to-end speech-to-speech systems — before they're mainstream, not after. Serve as the technical DRI for model partner integrations — lead deep collaboration with partners such as Cartesia, Deepgram, and Rime; own the full lifecycle from integration to optimization to ongoing performance accountability. Diagnose and resolve the hardest performance problems in the stack — conduct systematic profiling and root-cause analysis from GPU kernel behavior to framework-level bottlenecks; drive shipped improvements with documented, measurable impact. Influence platform architecture across the organization — partner with platform engineering leadership to ensure the serving layer is built for the latency and reliability demands of real-time voice APIs; your technical decisions should raise the ceiling for the whole team. Define and scale voice fine-tuning capabilities — lead the technical direction for enabling customers to fine-tune STT and TTS models on Together's infrastructure, establishing the primitives for differentiated voice experiences. Lay technical foundations for a category-defining product surface — architect systems with enough foresight that they support multiple new voice products with minimal rework; think in terms of platforms, not point solutions. Requirements 8+ years of ML engineering experience, with a demonstrated focus on model serving, inference optimization, or ML infrastructure at production scale — including systems you've owned from design through live traffic. Deep, practical expertise in LLM serving engines (vLLM, SGLang, TensorRT-LLM, or equivalent) — you've modified engine internals, debugged edge cases under load, and contributed improvements back; you don't stop at the API surface. Expert-level Python and PyTorch proficiency, with a strong command of GPU optimization — CUDA kernels, memory hierarchies, profiling toolchains — and a track record of turning that knowledge into shipped latency or throughput wins. Proven system design judgment — you've made architectural decisions that held up at scale and influenced how a team or platform evolved; you can articulate the tradeoffs you made and why. Strong technical leadership — you operate with high autonomy, define the right problems before solving them, and raise the bar for engineering quality around you without requiring process overhead. Sharp product intuition for developer tooling — you understand what voice application developers actually need to ship great products, and you let that shape your technical priorities, not just the other way around. Proven ability to move fast in ambiguous environments — you've thrived on early-stage or platform teams where scope is wide, ownership is deep, and the roadmap you build is the one you execute. Strong foundation in speech and audio ML (ASR/TTS architectures, audio signal processing) — directly relevant experience is strongly preferred; exceptional ML engineering fundamentals with genuine curiosity about the domain is also considered. Familiarity with audio codec and tokenization schemes (SNAC, Encodec, DAC) is a meaningful plus at this level. Experience training or fine-tuning speech models at scale is a significant advantage. Bachelor's or Master's in Computer Science, Electrical Engineering, or related field — or equivalent depth demonstrated through your work. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $220,000 - $280,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

?

Unknown company· San Francisco

About the Role In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing. You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads. Responsibilities Design multi-petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing). Design/optimize RDMA, InfiniBand, 400GbE networks; tune for max throughput/min latency; implement NVMe-oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage. Build Kubernetes storage operators/controllers; enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas; create reusable Helm/Terraform patterns. Deliver 10-50 GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes. Build multi-tier caches (local NVMe, distributed, object); optimize data locality and model-weight distribution; implement smart prefetching/eviction. Implement monitoring, alerting, SLOs; design DR/backups with runbooks; run chaos engineering; ensure 99.9%+ uptime via proactive/automated remediation. Partner with ML/SRE teams; mentor on storage best practices; contribute to open-source; write docs, postmortems, and public learnings. Requirements 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale Proven track record deploying and operating high-performance storage for GPU/HPC clusters Deep Kubernetes and cloud-native storage experience in production environments Strong coding skills in Go and Python with demonstrated ability to build production-grade tools BS/MS in Computer Science, Engineering, or equivalent practical experience History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput) Programming: Go and Python for automation, operators, and tooling Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD) Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations Observability: Prometheus, Grafana, Thanos architecture and operations Nice to Have Skills GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE) ML/AI storage patterns (model weights, checkpointing, dataset caching) Kubernetes operator development (controller-runtime, kubebuilder) Storage snapshots, cloning, and thin provisioning Backup and disaster recovery (Velero, Restic, cross-region replication) Storage encryption (at-rest and in-transit), security and compliance Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace) About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $250,000 - $300,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Senior Machine Learning Engineer, Voice AI

?

Unknown company· San Francisco

About the Role Together AI is building the best inference infrastructure for voice applications. Our Voice AI platform powers production-grade, real-time voice agents and applications — serving speech-to-text and text-to-speech models with best-in-class latency and reliability. We're looking for a Senior ML Engineer to drive the model serving layer for voice workloads. You'll work hands-on with inference engines like TRT-LLM and SGLang to optimize how we serve models like Whisper, Parakeet, Orpheus, and Kokoro — pushing latency and throughput to the frontier. You'll profile GPU utilization, design batching strategies for streaming audio, and ensure new model architectures can go from research to production quickly. This is a foundational hire on a small, high-impact team. Voice inference has unique challenges — streaming audio, tokenization, real-time latency budgets — that require dedicated ML engineering focus. You'll shape how Together serves voice models as the industry moves from pipeline architectures (ASR → LLM → TTS) toward end-to-end speech-to-speech. Own the model serving stack that powers Together's voice platform across STT, TTS, and speech-to-speech. Work directly with state-of-the-art accelerators (H100s, H200s, B200s) to optimize voice model inference. Collaborate with model partners (Cartesia, Deepgram, Rime, and others) to bring their models to production on Together's infrastructure. Build quality evaluation frameworks that guide model selection for customers and inform the roadmap. Join a small, early-stage team with outsized impact on a fast-growing product area. Responsibilities Optimize inference performance for voice models (STT, TTS, speech-to-speech) — targeting best-in-class TTFB, throughput, and GPU utilization across our curated model set. Productionize voice models on serverless and dedicated endpoints, including batching strategies, streaming inference, and memory management tailored to audio workloads. Build and maintain a voice model evaluation framework — measuring WER across accents, languages, and noise conditions for STT; naturalness, latency, and pronunciation accuracy for TTS. Enable new model architectures in our serving stack as the field evolves, including audio-native LLMs, codec-based models (SNAC), and speech-to-speech systems. Collaborate with model partners to integrate and optimize their models (Cartesia, Deepgram, Rime, and others) running on Together's infrastructure. Profile and debug performance across the full inference stack — from GPU kernels to framework-level bottlenecks — and ship measurable improvements. Work with the platform engineering side of the team to ensure the serving layer meets the latency and reliability requirements of real-time voice APIs. Contribute to voice model fine-tuning capabilities (STT and TTS) as we enable customers to build differentiated voice experiences on Together. Lay the groundwork for multiple new products down the line. Requirements 5+ years of experience in ML engineering, with a focus on model serving, inference optimization, or ML infrastructure. Hands-on experience with LLM serving engines (vLLM, SGLang, TensorRT-LLM, or similar) — comfortable reading and modifying engine internals, not just using APIs. Strong proficiency in Python and PyTorch; experience with GPU profiling and optimization (CUDA, memory management, kernel-level debugging). Track record of shipping ML systems to production with measurable performance improvements. Strong product sense — you think about what developers building voice apps actually need, not just what's technically interesting. Comfort working on a small, early-stage team where you'll wear multiple hats and move fast. Experience with speech and audio ML (ASR, TTS architectures, audio signal processing) is a strong plus but not required — you can learn this quickly if you have strong ML engineering fundamentals. Familiarity with audio codecs and tokenization schemes (SNAC, Encodec, DAC) is a plus. Experience training or fine-tuning speech models is a plus. Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field, or equivalent practical experience About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $200,000 - $260,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Research Engineer, Frontier Speculative Decoding

?

Unknown company· San Francisco, New York City

About the Role Together AI is building the Inference Platform that powers the world's most advanced generative AI models. Your role will be a critical bridge between cutting-edge research and real-world applications, focusing on making translating our internal model training research to production-ready deployment for our customers. This involves a deep commitment to data-centric development, meticulous hyperparameter tuning, and rigorous checkpoint evaluation before models ever hit production. This role will involve understanding customer specific needs and fine-tuning models on our internal data recipe and their proprietary data. The goal is to transform general-purpose models into highly performant, specialized tools that solve real business problems. You will not be training foundation models from scratch but rather focusing on creating highly efficient, specialized models by working with dedicated GPU clusters. Responsibilities Design and iterate on novel speculator algorithms, combining architectural innovations with carefully curated data to push the frontier of accuracy–efficiency tradeoffs. Be the critical link between raw data and a production-ready model, seeing your work directly impact our customers' success. Work in a fast-paced, high-impact role at the cutting edge of generative AI. Collaborate with a team of experts dedicated to solving real-world, high-performance challenges. You'll collaborate directly with customers to understand their needs, and work closely with our core inference and Applied ML research teams to integrate your work into the production platform. A culture of deep technical ownership where you are empowered to take on and solve challenging problems Requirements A genuine love for data curation and processing, with a meticulous attention to detail. You believe that great models start with great data. Demonstrated ability to perform effective hyperparameter searches and understand the trade-offs involved in tuning models for specific tasks. Experience working with and building on top of existing training codebases. You are comfortable navigating complex code and contributing to its improvement. Strong attention-to-detail in evaluating model checkpoints to ensure they meet strict quality, performance, and reliability standards. Experience with Python and PyTorch. Familiarity with SLURM and/or Kubernetes clusters and experience submitting and managing jobs in a high-performance computing environment. Familiarity with modern LLMs and generative models. Basic understanding of distributed training frameworks (e.g., FSDP, DeepSpeed). Bachelor’s, Master’s degree, or Ph.D. in Computer Science, Computer Engineering, or a related field, or equivalent practical experience. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, ATLAS, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $190,000 - $270,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Research Engineer, Core ML

?

Unknown company· San Francisco

About the Role This is a research engineering role with direct production impact. You won’t be publishing ideas in isolation—you will translate new RL algorithms, scheduling methods, and inference optimizations into production-grade systems that power Together’s API. Success in this role means shipping measurable improvements in latency, throughput, cost, and model quality at scale. We are looking for researchers who enjoy owning systems end-to-end and turning frontier ideas into robust infrastructure. The Core ML (Turbo) at Together AI team sits at the intersection of efficient inference (algorithms, architectures, engines) and post‑training / RL systems. We build and operate the systems behind Together’s API, including high‑performance inference and RL/post‑training engines that can run at production scale. Our mandate is to push the frontier of efficient inference and RL‑driven training: making models dramatically faster and cheaper to run, while improving their capabilities through RL‑based post‑training (e.g., GRPO‑style objectives). This work lives at the interface of algorithms and systems: asynchronous RL, rollout collection, scheduling, and batching all interact with engine design, creating many knobs to tune across the RL algorithm, training loop, and inference stack. Much of the job is modifying production inference systems—for example, SGLang‑ or vLLM‑style serving stacks and speculative decoding systems such as ATLAS—grounded in a strong understanding of post‑training and inference theory, rather than purely theoretical algorithm design. You’ll work across the stack—from RL algorithms and training engines to kernels and serving systems—to build and improve frontier models via RL pipelines. People on this team are often spiky: some are more RL‑first, some are more systems‑first. Depth in one of these areas plus appetite to collaborate across (and grow toward more full‑stack ownership over time) is ideal. Responsibilities Advance inference efficiency end‑to‑end Design and prototype algorithms, architectures, and scheduling strategies for low‑latency, high‑throughput inference. Implement and maintain changes in high‑performance inference engines (e.g., SGLang‑ or vLLM‑style systems and Together’s inference stack), including kernel backends, speculative decoding (e.g., ATLAS), quantization, etc. Profile and optimize performance across GPU, networking, and memory layers to improve latency, throughput, and cost. Unify inference with RL / post‑training Design and operate RL and post‑training pipelines (e.g., RLHF, RLAIF, GRPO, DPO‑style methods, reward modeling) where 90+% of the cost is inference, jointly optimizing algorithms and systems. Make RL and post‑training workloads more efficient with inference‑aware training loops—for example, async RL rollouts, speculative decoding, and other techniques that make large‑scale rollout collection and evaluation cheaper. Use these pipelines to train, evaluate, and iterate on frontier models on top of our inference stack. Co‑design algorithms and infrastructure so that objectives, rollout collection, and evaluation are tightly coupled to efficient inference, and quickly identify bottlenecks across the training engine, inference engine, data pipeline, and user‑facing layers. Run ablations and scale‑up experiments to understand trade‑offs between model quality, latency, throughput, and cost, and feed these insights back into model, RL, and system design. Own critical systems at production scale Profile, debug, and optimize inference and post-training services under real production workloads, taking research ideas all the way to stable, measurable improvements in deployed systems. Drive roadmap items that require real engine modification—changing kernels, memory layouts, scheduling logic, and APIs as needed. Establish metrics, benchmarks, and experimentation frameworks to validate improvements rigorously. Provide technical leadership (Staff level) Set technical direction for cross‑team efforts at the intersection of inference, RL, and post‑training. Mentor other engineers and researchers on full‑stack ML systems work and performance engineering. Requirements We don’t expect anyone to check every box below. People on this team typically have deep expertise in one or more areas and enough breadth (or interest) to work effectively across the stack. The closer you are to full‑stack (inference + post‑training/RL + systems), the stronger the fit—but being spiky in one area and eager to grow is absolutely okay. You might be a good fit if you: Have a bias toward implementation and shipping —you are excited to modify real engines and services, not just prototype in research code. Have strong expertise in at least one of the following, and are excited to collaborate across (and grow into) the others: Systems‑first profile: Large‑scale inference systems (e.g., SGLang, vLLM, FasterTransformer, TensorRT, custom engines, or similar), GPU performance, distributed serving. RL‑first profile: RL / post‑training for LLMs or large models (e.g., GRPO, RLHF/RLAIF, DPO‑like methods, reward modeling), and using these to train or fine‑tune real models. Model architecture design for Transformers or other large neural nets. Distributed systems / high‑performance computing for ML. Are comfortable working from algorithms to engines: Strong coding ability in Python Experience profiling and optimizing performance across GPU, networking, and memory layers. Able to take a new sampling method, scheduler, or RL update and turn it into a production‑grade implementation in the engine and/or training stack. Have a solid research foundation in your area(s) of depth: Track record of impactful work in ML systems, RL, or large‑scale model training (papers, open‑source projects, or production systems). Can read new RL / post‑training papers, understand their implications on the stack, and design minimal, correct changes in the right layer (training engine vs. inference engine vs. data / API). Operate well as a full‑stack problem solver: You naturally ask: “Where in the stack is this really bottlenecked?” You enjoy collaborating with infra, research, and product teams, and you care about both scientific quality and user‑visible wins. Minimum qualifications 3+ years of experience working on ML systems, large‑scale model training, inference, or adjacent areas (or equivalent experience via research / open source). Advanced degree in Computer Science, EE, or a related field, or equivalent practical experience. Demonstrated experience owning complex technical projects end‑to‑end. If you’re excited about the role and strong in some of these areas, we encourage you to apply even if you don’t meet every single requirement. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $200,000 - $280,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Machine Learning, Platform Engineer

?

Unknown company· San Francisco

About the Role Our team focuses on enabling custom models and dedicated inference on Together. We are responsible for building a container platform, optimizing autoscaling, minimizing cold starts, achieving the best end-to-end model performance, and providing a best-in-class developer experience with great tooling. We often focus on video or audio generation across the stack: CUDA kernels, pytorch optimization, inference engines, container orchestration, queueing theory, etc. An ideal candidate will be great at profiling/optimization but know the word kubernetes, or be intimately familiar with multi-cluster scheduling and have some sense of ML bottlenecks. Responsibilities New hires may work on multi-cluster orchestration, portfolio optimization, predictive autoscaling, control panes, model bring-up, model optimization, APIs for managing deployments, inference worker SDKs, and CLI tools. Analyze and improve the robustness and scalability of existing distributed systems, APIs, databases, and infrastructure Partner with product teams to understand functional requirements and deliver solutions that meet business needs Write clear, well-tested, and maintainable software and IaC for both new and existing systems Conduct design and code reviews, create developer documentation, and develop testing strategies for robustness and fault tolerance Requirements 5+ years of demonstrated experience in building large scale, fault tolerant, distributed systems. Experience running serverless inference platforms, doing model bring-up on short notice, being on call, or running a cloud provider is a very big plus Good taste and ability to thoughtfully discuss how what you’ve built has failed over time Experience designing, analyzing and improving efficiency, scalability, and stability of various system resources Excellent understanding of low level operating systems concepts including concurrency, networking and storage, performance and scale Expert-level programmer in one or more of Python, Golang, Rust, C++, or Haskell Proficiency in writing and maintaining Infrastructure as Code (IaC) using tools like Terraform Experience with Kubernetes internals or other container orchestration systems Sound judgement for when to use and when to not use LLMs for code Bachelor’s or Master’s degree in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience Writing-heavy roles or companies are a plus About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $250,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Machine Learning Engineer - Inference

?

Unknown company· San Francisco

About the Role Together AI is seeking a Machine Learning Engineer to join our Inference Engine team, focusing on optimizing and enhancing the performance of our AI inference systems. This role involves working with state-of-the-art large language models models and ensuring they run efficiently and effectively at scale. If you are passionate about AI inference, PyTorch, and developing high-performance systems, we want to hear from you. This position offers the chance to collaborate closely with AI researchers and engineers to create cutting-edge AI solutions. Join us in shaping the future at Together AI! Responsibilities Design and build the production systems that power the Together AI inference engine, enabling reliability and performance at scale. Develop and optimize runtime inference services for large-scale AI applications. Collaborate with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world. Conduct design and code reviews to ensure high standards of quality. Create services, tools, and developer documentation to support the inference engine. Implement robust and fault-tolerant systems for data ingestion and processing. Requirements 3+ years of experience writing high-performance, well-tested, production-quality code. Proficiency with Python and PyTorch. Demonstrated experience in building high performance libraries and tooling. Excellent understanding of low-level operating systems concepts including multi-threading, memory management, networking, storage, performance, and scale. Preferred: Knowledge of existing AI inference systems such as TGI, vLLM, TensorRT-LLM, Optimum Preferred: Knowledge of AI inference techniques such as speculative decoding. Preferred: Knowledge of CUDA/Triton programming. Nice to have: Knowledge of Rust, Cython and compilers. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society. Together, we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI. Our team has been behind technological advancements such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey to build the next-generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other competitive benefits. The US base salary range for this full-time position is $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level, and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunities to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Machine Learning Engineer

?

Unknown company· San Francisco

About the Role Together AI is looking for an ML Engineer who will develop systems and APIs that enable our customers to perform inference and fine tune LLMs. Relevant experience includes implementing runtime systems that perform inference at scale using AI/ML models from simple models up to the largest LLMs. Requirements 5+ years experience writing high-performance, well-tested, production quality code Bachelor’s degree in computer science or equivalent industry experience Familiar with LLM inference ecosystem, including frameworks and engines (e.g. vLLM, SGLang, TRT, ...) Demonstrated experience in building large scale, fault tolerant, distributed systems like storage, search, and computation Expert level programmer in one or more of Python, Go, Rust, or C/C++ Experience implementing runtime inference services at scale or similar Responsibilities Design and build the production systems that power the Together Cloud inference and fine-tuning APIs, enabling reliability and performance at scale Partner with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world Analyze and improve efficiency, scalability, and stability of various system resources Conduct design and code reviews Create services, tools & developer documentation Create testing frameworks for robustness and fault-tolerance Participate in an on-call rotation to respond to critical incidents as needed About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $220,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

LLM Inference Frameworks and Optimization Engineer

?

Unknown company· San Francisco, Singapore, Amsterdam

About the Role At Together.ai, we are building state-of-the-art infrastructure to enable efficient and scalable inference for large language models (LLMs). Our mission is to optimize inference frameworks, algorithms, and infrastructure, pushing the boundaries of performance, scalability, and cost-efficiency. We are seeking an Inference Frameworks and Optimization Engineer to design, develop, and optimize distributed inference engines that support multimodal and language models at scale. This role will focus on low-latency, high-throughput inference, GPU/accelerator optimizations, and software-hardware co-design, ensuring efficient large-scale deployment of LLMs and vision models. This role offers a unique opportunity to shape the future of LLM inference infrastructure, ensuring scalable, high-performance AI deployment across a diverse range of applications. If you're passionate about pushing the boundaries of AI inference, we’d love to hear from you! Responsibilities Inference Framework Development and Optimization Design and develop fault-tolerant, high-concurrency distributed inference engine for text, image, and multimodal generation models. Implement and optimize distributed inference strategies, including Mixture of Experts (MoE) parallelism, tensor parallelism, pipeline parallelism for high-performance serving. Apply CUDA graph optimizations, TensorRT/TRT-LLM graph optimizations, and PyTorch-based compilation (torch.compile), and speculative decoding to enhance efficiency and scalability. Software-Hardware Co-Design and AI Infrastructure Collaborate with hardware teams on performance bottleneck analysis, co-optimize inference performance for GPUs, TPUs, or custom accelerators. Work closely with AI researchers and infrastructure engineers to develop efficient model execution plans and optimize E2E model serving pipelines. Requirements Must-Have: Experience: 3+ years of experience in deep learning inference frameworks, distributed systems, or high-performance computing. Technical Skills: Familiar with at least one LLM inference frameworks (e.g., TensorRT-LLM, vLLM, SGLang, TGI(Text Generation Inference)). Background knowledge and experience in at least one of the following: GPU programming (CUDA/Triton/TensorRT), compiler, model quantization, and GPU cluster scheduling. Deep understanding of KV cache systems like Mooncake , PagedAttention , or custom in-house variants. Programming: Proficient in Python and C++/CUDA for high-performance deep learning inference. Optimization Techniques: Deep understanding of Transformer architectures and LLM/VLM/Diffusion model optimization. Knowledge of inference optimization, such as workload scheduling, CUDA graph, compiled, efficient kernels Soft Skills: Strong analytical problem-solving skills with a performance-driven mindset. Excellent collaboration and communication skills across teams. Nice-to-Have: Experience in developing software systems for large-scale data center networks with RDMA/RoCE Familiar with distributed filesystem(e.g., 3FS, HDFS, Ceph) Familiar with open source distributed scheduling/orchestration frameworks, such as Kubernetes (K8S) Contributions to open-source deep learning inference projects. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

AI infrastructure Engineer (SRE) Amsterdam

?

Unknown company· Amsterdam

As a AI Infrastructure Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline, and mature automation to our operating environments and codebase. You specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability and scalability, with varied interests in algorithms and distributed systems. Requirements 7+ years of professional SRE or related experience Bachelor's degree in Computer Science or a related field or equivalent work experience Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes Proficiency in programming/scripting languages Direct experience in monitoring and observability practices Advanced knowledge of cloud services Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts Responsibilities Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users Build monitoring systems to ensure the highest quality service for our customers Design and implement operational processes (such as deployments and upgrades) Debug production issues across all services and levels of the stack Identify improvements for the product architecture from the reliability, performance and availability perspectives Plan the growth of Together AI’s infrastructure About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at https://www.together.ai/privacy

18h ago

Open positions (26)

AI Infrastructure Engineer

Systems Research Engineer Intern - GPU Programming (Fall 2026)

Systems Research Engineer, GPU Programming

Staff Machine Learning Engineer, Voice AI

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

Senior Machine Learning Engineer, Voice AI

Research Engineer, Frontier Speculative Decoding

Research Engineer, Core ML

Machine Learning, Platform Engineer

Machine Learning Engineer - Inference

Machine Learning Engineer

LLM Inference Frameworks and Optimization Engineer

AI infrastructure Engineer (SRE) Amsterdam

AI Infrastructure Engineer

Systems Research Engineer Intern - GPU Programming (Fall 2026)

Systems Research Engineer, GPU Programming

Staff Machine Learning Engineer, Voice AI

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

Senior Machine Learning Engineer, Voice AI

Research Engineer, Frontier Speculative Decoding

Research Engineer, Core ML

Machine Learning, Platform Engineer

Machine Learning Engineer - Inference

Machine Learning Engineer

LLM Inference Frameworks and Optimization Engineer

AI infrastructure Engineer (SRE) Amsterdam