Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
High Performance Computing Software Engineer - Supercomputing image - Rise Careers
Job details

High Performance Computing Software Engineer - Supercomputing

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

 

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

 

The Role

 

IFM is building the foundational compute infrastructure that will power tomorrow’s breakthroughs in AI and computational science. We’re looking for a High Performance Computing Software Engineer to help us design, develop, and operate the software systems that run our large-scale AI workloads.

 

In this role, you’ll work at the intersection of high-performance computing and machine learning. You’ll be part of a team responsible for crafting the software stack that enables training of cutting-edge ML models—spanning 1000+ GPUs—and ensuring our infrastructure is robust, performant, and developer-friendly.

Job Responsibilities

  • Design and implement high-performance, distributed software solutions for large-scale AI/ML training.
  • Optimize low-level system components including Linux kernel, GPU/accelerator kernels, and interconnects.
  • Develop and tune communication libraries such as NCCL, MPI, UCX, RCCL, and RDMA-based systems.
  • Partner with ML researchers and engineers to support frameworks like PyTorch, MegatronLM, and DeepSpeed in large-scale production environments.
  • Contribute to our scheduling, orchestration, and job management systems, including Slurm and Kubernetes.
  • Debug and resolve complex issues across the stack—from kernel to container to model.
  • Work closely with hardware vendors, upstream open-source communities, and internal teams to drive performance and reliability improvements.

Skills & Experience

  • Proven experience developing and optimizing software for large-scale ML workloads (1000+ GPUs preferred).
  • Deep understanding of Linux kernel internals and accelerator (GPU) kernel development.
  • Proficiency with distributed communication libraries (e.g., NCCL, RCCL, MPI, UCX, SHARP, Libfabric).
  • Experience with ML frameworks like PyTorch, TensorFlow, JAX, or MegatronLM.
  • Strong knowledge of HPC job scheduling and orchestration tools (e.g., Slurm, Kubernetes, Pyxis).
  • Excellent debugging and systems performance tuning skills.
  • A collaborative mindset with a focus on shared success and technical excellence.


$150,000 - $300,000 a year
Benefits Include
*Comprehensive medical, dental, and vision benefits 
 *Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability
 

Average salary estimate

$225000 / YEARLY (est.)
min
max
$150000K
$300000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs
Photo of the Rise User

Lead the architecture, development, and stabilization of Cardinal Health's cloud-native eCommerce platforms while guiding distributed engineering teams and driving modernization efforts.

Photo of the Rise User
Posted 24 hours ago

Senior Director responsible for leading application engineering and productionization to deliver enterprise-grade AI/ML and digital applications at scale for Pfizer's AI Acceleration organization.

Photo of the Rise User
Posted 7 hours ago

Alegeus is hiring a Software Engineer II to design, develop, and maintain .NET-based SaaS applications that support fintech and healthtech solutions in a collaborative, hybrid environment.

Photo of the Rise User
Anduril Industries Hybrid Waltham, Massachusetts, United States
Posted 3 hours ago

Anduril is hiring a Senior Software Engineer to develop high-performance algorithms and software (C/C++, Python, Matlab) for AI-driven, mission-critical defense systems.

Be the Forward Deployed Engineer who owns customer integrations, shapes product direction, and ensures stablecoin payments integrate reliably with partners at an early-stage NYC startup.

Photo of the Rise User
Greenhouse Hybrid Anywhere in the United States
Posted 9 hours ago

Greenhouse is hiring a Senior UX Engineer, Design Systems to build reusable, accessible component patterns and documentation that enable product teams to ship faster and more consistently.

Photo of the Rise User

Experienced Java/J2EE developer needed to lead enhancements for a retail e‑commerce core platform, with Oracle and ATG experience strongly preferred.

Photo of the Rise User
Posted 41 minutes ago

Experienced backend engineer needed to architect and operate large-scale ingestion, processing, and inference systems that power Scribe's Workflow AI platform.

Photo of the Rise User
Okta Hybrid San Francisco, California
Posted 2 hours ago
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony
Maternity Leave
Paternity Leave
401K Matching
Paid Holidays
Paid Sick Days
Paid Time-Off
Paid Volunteer Time
Health Savings Account (HSA)
Flexible Spending Account (FSA)
Family Coverage (Insurance)
Medical Insurance
Mental Health Resources

Lead the design and delivery of cloud-native privileged access infrastructure at Okta, contributing to FedRAMP efforts and operating high-scale services built on Kubernetes, AWS, and modern observability tooling.

Photo of the Rise User
Bosch Group Hybrid 100 Southchase Blvd, Fountain Inn, SC 29644, USA
Posted 13 hours ago

Bosch Rexroth is hiring a Summer 2026 Software Engineering Intern to develop C# tools that generate and optimize C++ code for embedded systems in mobile machine applications.

Photo of the Rise User
NBCUniversal Hybrid 100 Universal City Plaza, Universal City, CA 91608, USA
Posted 18 hours ago

Academic Year internship at NBCUniversal's Universal Pictures Content Group focused on full-stack and AR/VR development, machine learning experimentation, and digital transformation projects.

Posted 21 hours ago

Help scale Chime's AI-powered Jade assistant by building platform tooling, backend services, and observability systems as a Senior Full-Stack Engineer.

SEC Hybrid 645 Clyde Avenue, Mountain View, CA, USA
Posted 15 hours ago

Senior software process engineer for Samsung's eCommerce platform, responsible for driving scalable architecture, data privacy, and SDLC best practices.

MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
HQ LOCATION
No info
EMPLOYMENT TYPE
Full-time, onsite
DATE POSTED
April 5, 2026
Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!