Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
ML Infra Engineer - Supercomputing image - Rise Careers
Job details

ML Infra Engineer - Supercomputing

Physical Intelligence builds general-purpose AI for the physical world. Training our models requires orchestrating thousands of accelerators across a heterogeneous fleet of GPU and TPU clusters — spanning different hardware generations, cloud providers, and cluster topologies.

Today, researchers often need to know which cluster to target, what resources are available, and how to configure their jobs accordingly. That doesn't scale. We need a scheduling and compute layer that makes the right placement decision automatically — routing jobs to the best cluster based on availability, hardware fit, cost, and priority — so researchers can focus entirely on the science.

This role owns that problem end-to-end: the scheduling systems, the placement logic, the cluster management layer, and the operational tooling that keeps it all running.

This is not cloud DevOps. It's not about standing up clusters and walking away. It's a systems role for people who care about intelligent resource allocation, utilization, fault tolerance, and making large-scale distributed training seamless.

The Team

The ML Infrastructure team supports and accelerates PI’s core modeling efforts by building the systems that make large-scale training reliable, reproducible, and fast. You will work closely with ML Infra (training systems), data platform, and research teams to ensure compute scheduling is never the bottleneck.

In This Role You Will

- Own Intelligent Job Scheduling and Placement: Design and build multi-tenant scheduling systems that automatically place training jobs on the best available cluster based on hardware requirements, topology, availability, cost, and priority. Support fair resource sharing across teams and projects with quota management, priority tiers, and preemption policies. Abstract away cluster differences so researchers submit jobs without needing to know where they will land.

- Scale Multi-cluster Orchestration: Build the control plane that manages the job lifecycle across diverse clusters (mixed GPU/TPU, multi-generation hardware, on-prem/cloud) and enables seamless job migration, failover, and re-scheduling.

- Optimize Accelerator Utilization and Efficiency: Monitor and optimize GPU/TPU utilization across the entire fleet. Implement priority, preemption, queueing, and fairness policies that balance research velocity with cost efficiency.

- Ensure Scaling and Stability: Implement fault detection, automatic recovery, and resilience for long-running multi-node training jobs. Manage health checking, node management, and scaling to thousands of accelerators.

- Support Inference and Robot Deployment: Extend scheduling and orchestration to inference workloads, including deploying models to edge devices on physical robots.

- Enhance Observability and Developer Experience: Build the dashboards, alerting, SLOs, and debugging tools necessary for researchers to understand job status and for the team to ensure high scheduling quality and cluster reliability.

What We Hope You’ll Bring

We’re intentionally flexible on exact background, but strong candidates usually have:

- Strong software engineering fundamentals

- Experience building or operating job scheduling / resource management systems at scale

- Experience with large-scale compute clusters (GPU and/or TPU)

- Familiarity with schedulers and orchestration systems (SLURM, Kubernetes, GKE, K3S, or internal equivalents)

- Comfort reasoning about resource allocation, bin-packing, priority scheduling, and multi-tenancy

- Understanding of how ML training workloads behave — long-running, multi-node, sensitive to stragglers, topology-dependent

- A bias toward owning systems end-to-end, from design to operation

- Enjoy working closely with researchers and unblocking fast-moving projects

Bonus Points If You Have

- Experience building multi-cluster or federated scheduling systems

- Experience with TPU infrastructure (GCP TPU slices, Multislice, GKE)

- Background in cluster resource managers (Borg, YARN, Mesos, or custom schedulers)

- Linux systems engineering, networking, and infrastructure-as-code

- NCCL/collective communication and topology-aware placement

- Experience with capacity planning and cloud cost optimization at scale

- Familiarity with JAX, PyTorch, or similar ML frameworks at the runtime/systems level

In this role you will help scale and optimize our training systems and core model code. You’ll own critical infrastructure for large-scale training, from managing GPU/TPU compute and job orchestration to building reusable and efficient JAX training pipelines. You’ll work closely with researchers and model engineers to translate ideas into experiments—and those experiments into production training runs.

This is a hands-on, high-leverage role at the intersection of ML, software engineering, and scalable infrastructure.

The Team

The ML Infrastructure team supports and accelerates PI’s core modeling efforts by building the systems that make large-scale training reliable, reproducible, and fast. The team works closely with research, data, and platform engineers to ensure models can scale from prototype to production-grade training runs.

In This Role You Will

- Own training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, including scheduling, job management, checkpointing, and metrics/logging.

- Scale distributed training: Work with researchers to scale JAX-based training across TPU and GPU clusters with minimal friction.

- Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization.

- Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments.

- Manage compute resources: Ensure efficient allocation and utilization of cloud-based GPU/TPU compute while controlling cost.

- Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale.

- Contribute to core training code: Evolve JAX model and training code to support new architectures, modalities, and evaluation metrics.

What We Hope You’ll Bring

- Strong software engineering fundamentals and experience building ML training infrastructure or internal platforms.

- Hands-on large-scale training experience in JAX (preferred), PyTorch.

- Familiarity with distributed training, multi-host setups, data loaders, and evaluation pipelines.

- Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS).

- Ability to debug and optimize performance bottlenecks across the training stack.

- Strong cross-functional communication and ownership mindset.

Bonus Points If You Have

- Deep ML systems background (e.g., training compilers, runtime optimization, custom kernels).

- Experience operating close to hardware (GPU/TPU performance tuning).

- Background in robotics, multimodal models, or large-scale foundation models.

- Experience designing abstractions that balance researcher flexibility with system reliability.

Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Awesome Motive Glassdoor Company Review
4.2 Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon Glassdoor star icon
Awesome Motive DE&I Review
4.4 Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon Glassdoor star icon
CEO of Awesome Motive
Awesome Motive CEO photo
Kartik Mandaville
Approve of CEO

Average salary estimate

$240000 / YEARLY (est.)
min
max
$180000K
$300000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Awesome Motive logo

What it's like to work at Awesome Motive

Read Reviews
Similar Jobs
Photo of the Rise User
Awesome Motive Hybrid 10 Universal City Plaza, North Hollywood, CA 91608, USA
Posted 9 hours ago

Senior Counsel, Commercial Transactions needed to lead and scale technology and vendor contracting for a fast-evolving media company, with a special focus on generative AI legal issues.

Photo of the Rise User
Awesome Motive Hybrid 900 Sylvan Avenue, Englewood Cliffs, NEW JERSEY
Posted 8 hours ago

Experienced buy-side equity analyst needed to build and operationalize LLM-driven equity research and factor models for a public-media-backed AI investing platform at CNBC headquarters.

Photo of the Rise User
Posted 13 hours ago
Customer-Centric
Collaboration over Competition
Fast-Paced
Growth & Learning

Lithic is looking for an Engineering Manager to lead the Processing team responsible for low-latency, highly available transaction processing and network peering across card networks.

Photo of the Rise User

Lead the development of scalable backend systems and CV-driven features for a fast-moving youth-sports platform, shaping automated highlights and video analytics used by millions.

Photo of the Rise User

Senior Software Engineer to develop and field edge compute and communications software for mission-critical systems at Anduril's ECC team in Costa Mesa.

Photo of the Rise User
Posted 24 hours ago

Senior Software Engineer (Mobile) to lead and deliver high-quality React Native mobile experiences while contributing across Rev’s full-stack platform to accelerate growth and engagement.

Photo of the Rise User

Liatrio is hiring a Principal Application Modernization Engineer to lead architectural direction, deliver complex modernization workstreams, and integrate AI capabilities into enterprise applications.

Photo of the Rise User
Posted 24 hours ago
Customer-Centric
Mission Driven
Inclusive & Diverse
Rise from Within
Diversity of Opinions
Work/Life Harmony
Growth & Learning
Transparent & Candid
Medical Insurance
Paid Time-Off
Maternity Leave
Mental Health Resources
Equity
Child Care stipend
Paternity Leave
WFH Reimbursements
Flex-Friendly
Dental Insurance
Vision Insurance
Life insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
401K Matching
Military leave

NVIDIA's NVHPC compilers & tools group seeks a Senior HPC Performance Engineer to analyze and optimize high-performance applications across CPU and GPU architectures and guide compiler and application engineering improvements.

Photo of the Rise User
Posted 7 hours ago

Experienced Principal Software Engineer sought to lead architecture, mentor teams, and deliver scalable, high-performance ecommerce solutions across Backcountry’s portfolio.

Posted 14 hours ago

Help build TierZero's core product as a founding engineer, designing agentic LLM systems, ML pipelines, and scalable infrastructure to accelerate how teams run code in production.

Posted 15 hours ago

Entry-level software developer role at Voya Financial working on designing, coding, testing and maintaining application components while supporting user requirements and learning from senior engineers.

Photo of the Rise User
Posted 6 hours ago
Customer-Centric
Mission Driven
Inclusive & Diverse
Rise from Within
Diversity of Opinions
Work/Life Harmony
Growth & Learning
Transparent & Candid
Medical Insurance
Paid Time-Off
Maternity Leave
Mental Health Resources
Equity
Child Care stipend
Paternity Leave
WFH Reimbursements
Flex-Friendly
Dental Insurance
Vision Insurance
Life insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
401K Matching
Military leave

NVIDIA is looking for a Senior Systems Software Engineer to build and operate Golang-based cloud platform services that enable large-scale Kubernetes-powered AI infrastructure.

Photo of the Rise User
Customer-Centric
Mission Driven
Inclusive & Diverse
Rise from Within
Diversity of Opinions
Work/Life Harmony
Growth & Learning
Transparent & Candid
Medical Insurance
Paid Time-Off
Maternity Leave
Mental Health Resources
Equity
Child Care stipend
Paternity Leave
WFH Reimbursements
Flex-Friendly
Dental Insurance
Vision Insurance
Life insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
401K Matching
Military leave

Senior Architect role to design and implement high-performance AI communication and memory libraries while driving hardware-software co-optimization across GPUs, DPUs, NICs, and interconnects at NVIDIA.

Photo of the Rise User
Okta Hybrid San Francisco, California
Posted 2 hours ago
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony
Maternity Leave
Paternity Leave
401K Matching
Paid Holidays
Paid Sick Days
Paid Time-Off
Paid Volunteer Time
Health Savings Account (HSA)
Flexible Spending Account (FSA)
Family Coverage (Insurance)
Medical Insurance
Mental Health Resources

Lead the design and delivery of cloud-native privileged access infrastructure at Okta, contributing to FedRAMP efforts and operating high-scale services built on Kubernetes, AWS, and modern observability tooling.

MLabs Hybrid No location specified
Posted 16 hours ago

Senior Technical Architect needed to lead architecture, prototyping, and technical decisions for R&D product work on a Tiered Pricing Mechanism in a remote Web3/DeFi research unit.

SpringRole is the first professional reputation network powered by artificial intelligence and blockchain to eliminate fraud from user profiles. Because SpringRole is built on blockchain and uses smart contracts, it's able to verify work experienc...

742 jobs
MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, onsite
DATE POSTED
April 4, 2026
Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!