Job details

Senior Site Reliability Engineer

Who We Are

Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By aggregating computing resources across the globe, we offer an innovative GPU marketplace and AI inference service that promise affordability and accessibility for all. As pioneers at the intersection of AI and open-source technology, we believe in an open future where AI innovation is limited only by imagination, not by access to resources. We're looking for forward-thinking individuals who share our passion for making AI universally accessible, secure, and affordable. Join us in building a platform that empowers innovators everywhere to turn their visionary AI projects into reality.

As we prepare for growth after our Series A, our team — led by co-founders with PhDs in AI, Math, and Computer Science — is poised to redefine computing.

About the Role

We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security. As an aggregator of compute resources from hundreds of global suppliers, our SLOs, trust, and economic efficiency are product-critical. You'll be responsible for defining and maintaining service level objectives for job success rates, building robust incident response systems, managing capacity across our distributed GPU network, and implementing secure rollout and rollback mechanisms that keep our platform running smoothly 24/7.

In this role, you'll establish the reliability standards that define customer trust in our platform, design monitoring and alerting systems that provide deep visibility into our infrastructure, build automation for capacity management and resource allocation, lead incident response and post-mortem processes, and work closely with engineering teams to improve system resilience. You'll also focus on security and infrastructure hardening, ensuring strong isolation between tenants and suppliers, implementing key management systems, and building compliance frameworks. This is a high-impact position where your work directly influences our ability to deliver on our promise of affordable, accessible AI compute at scale.

Who You Are

Expert in site reliability engineering with proven experience defining, monitoring, and maintaining SLOs and SLAs for production systems
Strong background in capacity planning and management, including forecasting, resource allocation, and cost optimization for distributed systems
Experienced in incident response, on-call rotations, and post-mortem processes with a track record of reducing MTTR and improving system resilience
Deep knowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms
Proficient in observability tools and practices including metrics, logging, tracing, and alerting systems (Prometheus, Grafana, ELK stack, or similar)
Strong understanding of infrastructure security including tenant isolation, workload isolation, network segmentation, and security hardening
Experience with secrets management, key management systems (KMS), certificate management, and secure credential rotation
Knowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001, or similar)
Excellent problem-solving skills with ability to debug complex distributed systems issues under pressure
Strong automation mindset with experience using infrastructure-as-code, configuration management, and CI/CD pipelines

Preferred Qualifications

Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale
Background in distributed systems, peer-to-peer networks, or decentralized infrastructure
Knowledge of multi-tenancy security patterns, container security, and runtime security tools
Experience with chaos engineering, fault injection, and resilience testing
Familiarity with cost optimization strategies for cloud infrastructure and GPU resources
Experience building and operating systems with demanding uptime requirements (99.9%+ SLAs)
Background at companies like AWS, Google Cloud, Azure, or fast-growing infrastructure startups
Contributions to open-source reliability, observability, or security tools

Hyperbolic is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Site Reliability Engineer SRE GPU Kubernetes Prometheus Grafana Terraform observability SLO incident response capacity planning SOC 2 KMS CI/CD canary deployments feature flags

Average salary estimate

$210000 / YEARLY (est.)

min

max

$160000K

$260000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Principal Application Modernization Engineer

Liatrio Hybrid Remote

VIEW

Posted 20 hours ago

Liatrio is hiring a Principal Application Modernization Engineer to lead architectural direction, deliver complex modernization workstreams, and integrate AI capabilities into enterprise applications.

Jr. Full stack .net developer (Charlotte, NC)

Cypress Global Services, Inc Hybrid NC-115, Charlotte, NC, USA

VIEW

Posted 21 hours ago

A growing IT services firm is hiring a Jr. Full Stack .NET Developer to implement .NET Core web applications, APIs, and database solutions in a collaborative team environment.

Java Developer with Retail and ATG

Awesome Motive Hybrid Detroit, MI

VIEW

Posted 21 hours ago

Experienced Java/J2EE developer needed to lead enhancements for a retail e‑commerce core platform, with Oracle and ATG experience strongly preferred.

Direct-to-Consumer Engineering, DevOps Internships – Academic Year

NBCUniversal Hybrid 30 Rockefeller Plaza, New York, NY 10111, USA

VIEW

Posted 57 minutes ago

NBCUniversal's DTC Engineering DevOps Academic Year internship offers a paid, part-time remote opportunity to support TVE infrastructure, CI/CD, and automation across Peacock and other DTC platforms.

Angular Frontend Web Developer - Contingent

Aretum Hybrid No location specified

VIEW

Posted 15 hours ago

Experienced Angular frontend developer needed to implement accessible, component-driven web interfaces for a federal modernization program and collaborate with UX, backend, and product teams.

Senior UX Engineer, Design Systems

Greenhouse Hybrid Anywhere in the United States

VIEW

Posted 7 hours ago

Greenhouse is hiring a Senior UX Engineer, Design Systems to build reusable, accessible component patterns and documentation that enable product teams to ship faster and more consistently.

Software Engineer, Identity & Access

Patreon Hybrid No location specified

VIEW

Posted 11 hours ago

Inclusive & Diverse

Transparent & Candid

Growth & Learning

Diversity of Opinions

Mission Driven

Customer-Centric

Rapid Growth

Dare to be Different

Collaboration over Competition

Join Patreon's Identity & Access team to design and implement authentication, verification, and account-protection features that keep creators and their supporters safe and secure.

Senior Software Engineer

Fundrise Hybrid No location specified

VIEW

Posted 11 hours ago

Work on high-impact screening and fraud-prevention systems at Fundrise, building reliable, scalable software that protects millions of users while partnering closely with Legal, Finance, and Operations.

GTM Engineer, Marketing

Ironclad Hybrid San Francisco

VIEW

Posted 18 hours ago

Ironclad is hiring an AI-native GTM Engineer to architect and deploy autonomous agent systems and integrations that automate end-to-end marketing workflows and drive measurable revenue impact.

Senior Director of Engineering – Web Platform

A Place for Mom Hybrid No location specified

VIEW

Posted 2 hours ago

Lead and scale the Web Platform engineering organization to deliver high-performance, SEO-driven web experiences using modern web technologies and strong cross-functional collaboration.

Senior Software Engineer, CV Applications

Jobgether Hybrid US

VIEW

Posted 10 hours ago

Lead the development of scalable backend systems and CV-driven features for a fast-moving youth-sports platform, shaping automated highlights and video analytics used by millions.

Machine Learning Infrastructure Engineer, GenAI Technology

Point72 Hybrid United States

VIEW

Posted 10 hours ago

Point72 is hiring a Machine Learning Infrastructure Engineer to build and operate scalable GenAI infrastructure that accelerates model development and production across cloud and on-prem environments.

Lead Operational Software Deployment and Integration Engineer

Boeing Hybrid USA - Beale AFB, CA

VIEW

Posted 15 hours ago

Lead Operational Software Deployment and Integration Engineer responsible for on-site mission software deployment, integration, configuration control, and field readiness for Boeing Phantom Works at Beale AFB.

H Hyperbolic Labs

2 jobs

MATCH

Calculating your matching score...

FUNDING

Seed

DEPARTMENTS

Software Engineering

SENIORITY LEVEL REQUIREMENT

Senior Level

TEAM SIZE

No info