Job details

High Performance Computing Software Engineer - Supercomputing

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

IFM is building the foundational compute infrastructure that will power tomorrow’s breakthroughs in AI and computational science. We’re looking for a High Performance Computing Software Engineer to help us design, develop, and operate the software systems that run our large-scale AI workloads.

In this role, you’ll work at the intersection of high-performance computing and machine learning. You’ll be part of a team responsible for crafting the software stack that enables training of cutting-edge ML models—spanning 1000+ GPUs—and ensuring our infrastructure is robust, performant, and developer-friendly.

Job Responsibilities

Design and implement high-performance, distributed software solutions for large-scale AI/ML training.
Optimize low-level system components including Linux kernel, GPU/accelerator kernels, and interconnects.
Develop and tune communication libraries such as NCCL, MPI, UCX, RCCL, and RDMA-based systems.
Partner with ML researchers and engineers to support frameworks like PyTorch, MegatronLM, and DeepSpeed in large-scale production environments.
Contribute to our scheduling, orchestration, and job management systems, including Slurm and Kubernetes.
Debug and resolve complex issues across the stack—from kernel to container to model.
Work closely with hardware vendors, upstream open-source communities, and internal teams to drive performance and reliability improvements.

Skills & Experience

Proven experience developing and optimizing software for large-scale ML workloads (1000+ GPUs preferred).
Deep understanding of Linux kernel internals and accelerator (GPU) kernel development.
Proficiency with distributed communication libraries (e.g., NCCL, RCCL, MPI, UCX, SHARP, Libfabric).
Experience with ML frameworks like PyTorch, TensorFlow, JAX, or MegatronLM.
Strong knowledge of HPC job scheduling and orchestration tools (e.g., Slurm, Kubernetes, Pyxis).
Excellent debugging and systems performance tuning skills.
A collaborative mindset with a focus on shared success and technical excellence.

$150,000 - $300,000 a year

Benefits Include

*Comprehensive medical, dental, and vision benefits

*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability

HPC GPU NCCL MPI UCX RDMA Slurm Kubernetes PyTorch DeepSpeed MegatronLM Linux kernel C++ Python HPC Software Engineer Systems engineer GPU drivers

Average salary estimate

$225000 / YEARLY (est.)

min

max

$150000K

$300000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

eCommerce Technical Lead, Application Development and Maintenance

Cardinal Health Hybrid US-Nationwide-FIELD

VIEW

Posted 16 hours ago

Lead the architecture, development, and stabilization of Cardinal Health's cloud-native eCommerce platforms while guiding distributed engineering teams and driving modernization efforts.

Senior Director, Head of Application Engineering and Productionization

Pfizer Hybrid United States - New York - New York City

VIEW

Posted 24 hours ago

Senior Director responsible for leading application engineering and productionization to deliver enterprise-grade AI/ML and digital applications at scale for Pfizer's AI Acceleration organization.

Software Engineer II

Alegeus Hybrid Milwaukee

VIEW

Posted 7 hours ago

Alegeus is hiring a Software Engineer II to design, develop, and maintain .NET-based SaaS applications that support fintech and healthtech solutions in a collaborative, hybrid environment.

Senior Software Engineer

Anduril Industries Hybrid Waltham, Massachusetts, United States

VIEW

Posted 3 hours ago

Anduril is hiring a Senior Software Engineer to develop high-performance algorithms and software (C/C++, Python, Matlab) for AI-driven, mission-critical defense systems.

Forward Deployed Engineer

The Better Money Company Hybrid New York City

VIEW

Posted 8 hours ago

Be the Forward Deployed Engineer who owns customer integrations, shapes product direction, and ensures stablecoin payments integrate reliably with partners at an early-stage NYC startup.

Senior UX Engineer, Design Systems

Greenhouse Hybrid Anywhere in the United States

VIEW

Posted 9 hours ago

Greenhouse is hiring a Senior UX Engineer, Design Systems to build reusable, accessible component patterns and documentation that enable product teams to ship faster and more consistently.

Java Developer with Retail and ATG

Awesome Motive Hybrid Detroit, MI

VIEW

Posted 23 hours ago

Experienced Java/J2EE developer needed to lead enhancements for a retail e‑commerce core platform, with Oracle and ATG experience strongly preferred.

Sr Backend Software Engineer, Recording

Scribe Hybrid San Francisco

VIEW

Posted 41 minutes ago

Experienced backend engineer needed to architect and operate large-scale ingestion, processing, and inference systems that power Scribe's Workflow AI platform.

Staff Software Engineer - PAM Core

Okta Hybrid San Francisco, California

VIEW

Posted 2 hours ago

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Maternity Leave

Paternity Leave

401K Matching

Paid Holidays

Paid Sick Days

Paid Time-Off

Paid Volunteer Time

Health Savings Account (HSA)

Flexible Spending Account (FSA)

Family Coverage (Insurance)

Medical Insurance

Mental Health Resources

Lead the design and delivery of cloud-native privileged access infrastructure at Okta, contributing to FedRAMP efforts and operating high-scale services built on Kubernetes, AWS, and modern observability tooling.

Software Engineering Intern/Co-op- Summer 2026

Bosch Group Hybrid 100 Southchase Blvd, Fountain Inn, SC 29644, USA

VIEW

Posted 13 hours ago

Bosch Rexroth is hiring a Summer 2026 Software Engineering Intern to develop C# tools that generate and optimize C++ code for embedded systems in mobile machine applications.

Film Technology AR/VR Internships – Academic Year

NBCUniversal Hybrid 100 Universal City Plaza, Universal City, CA 91608, USA

VIEW

Posted 18 hours ago

Academic Year internship at NBCUniversal's Universal Pictures Content Group focused on full-stack and AR/VR development, machine learning experimentation, and digital transformation projects.

Senior Software Engineer, GenAI Platform

Chime Financial, Inc Hybrid San Francisco, CA, USA

VIEW

Posted 21 hours ago

Help scale Chime's AI-powered Jade assistant by building platform tooling, backend services, and observability systems as a Senior Full-Stack Engineer.

Engineer III, Software Process Engineering

SEC Hybrid 645 Clyde Avenue, Mountain View, CA, USA

VIEW

Posted 15 hours ago

Senior software process engineer for Samsung's eCommerce platform, responsible for driving scalable architecture, data privacy, and SDLC best practices.

I Institute of Foundation Models

1 jobs

MATCH

Calculating your matching score...

FUNDING

Private

DEPARTMENTS

Software Engineering

SENIORITY LEVEL REQUIREMENT

Senior Level

TEAM SIZE

No info