Job details

Lead Observability Engineer, Global

About Vantage Data Centers

Vantage Data Centers powers, cools, protects and connects the technology of the world’s well-known hyperscalers, cloud providers and large enterprises. Developing and operating across North America, EMEA and Asia Pacific, Vantage has evolved data center design in innovative ways to deliver dramatic gains in reliability, efficiency and sustainability in flexible environments that can scale as quickly as the market demands.

IT Standards Team

Our team is responsible for helping other technology teams on their automation journey, and for developing IT standards that support the IT organization. We embrace many approaches and technologies to speed up the delivery and operations of our Data Centers. From Zero-Touch provisioning of network equipment to the deployment of applications on containerization platforms, we apply our software and operation industry expertise everywhere we can. We question the status-quo and are not afraid to suggest new ways to do things. Individual contributors are encouraged to speak up, propose new insights and take an active role in the definition of our roadmap.

Position Overview

This role will be based remotely in the US.

Our team builds and operates the observability platform for Vantage Data Centers, enabling engineering and operations teams to understand system health, performance, and availability across data center and hybrid environments. To support our growth, we are looking for an experienced Observability Engineer with deep hands-on expertise in Elastic/Elasticsearch, Logstash, and Kibana, and a strong background creating and operationalizing metrics.

In this role, you will design, implement, and maintain end-to-end observability for logs and metrics: building resilient ingestion pipelines, defining schemas and parsing standards, creating Kibana dashboards and alerting, and partnering with platform, network, and application teams to set SLIs/SLOs and improve operational outcomes. You will continuously improve performance, reliability, retention, and cost of our telemetry pipelines while applying automation and infrastructure-as-code practices to keep the platform consistent and auditable.

Essential Job Functions

Design and operate a scalable observability platform with a primary focus on the Elastic Stack (Elasticsearch, Logstash, Kibana)
Build and maintain log ingestion and enrichment pipelines (Logstash) including parsing, normalization, and routing standards
Create, curate, and govern Kibana assets (dashboards, visualizations, Lens, Discover views) that support operations and engineering use cases
Define and implement metrics and alerting standards (SLIs/SLOs, thresholds, burn-rate alerts) to improve detection and reduce MTTR
Develop observability metrics for Operational Technology (OT) environments (e.g., BMS/EPMS/SCADA and other OT telemetry) to track availability, performance, alarms, and operational KPIs
Partner across teams to instrument services and infrastructure, troubleshoot incidents using telemetry, and continuously improve reliability, performance, and cost
Engineer and operate Elasticsearch clusters (or Elastic Cloud) including sizing, scaling, sharding/ILM, retention, backup/restore, and performance tuning
Develop and maintain Logstash pipelines (inputs/filters/outputs) to ingest logs/metrics from servers, network devices, virtualization platforms, containers, and cloud services
Create telemetry standards: field naming conventions, ECS alignment where appropriate, parsing/grok patterns, enrichment lookups, and data quality checks
Build and maintain Kibana dashboards, visualizations, and alerting rules; publish curated views for NOC/operations and engineering teams
Create and operationalize metrics: define SLIs/SLOs, implement metric collection/export, and ensure actionable alerting with runbooks and escalation paths
/Partner with facilities/critical infrastructure teams to define OT-focused SLIs/SLOs and metrics (e.g., alarm rates, sensor health, control loop status, device/point availability), normalize and tag OT telemetry, and build dashboards/alerts that support 24x7 operations
Automate configuration and deployment of observability components using infrastructure-as-code and configuration management (e.g., Terraform, Ansible) and CI/CD pipelines
Implement security best practices for telemetry platforms including role-based access control, data handling/PII controls, encryption, and auditability
Participate in incident response and post-incident reviews; use logs and metrics to identify root cause, document findings, and drive preventive improvements

Job Requirements

Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent practical experience
7+ years of experience in observability, SRE, platform, or DevOps engineering roles, with demonstrated ownership of production monitoring/logging systems
Deep, hands-on expertise with Elastic Stack: Elasticsearch (cluster operations and tuning), Logstash (pipeline engineering), and Kibana (dashboards and alerting)
Strong experience creating and operationalizing metrics (collection, aggregation, cardinality management, alerting strategy) and defining SLIs/SLOs
Proficiency in at least one scripting/programming language (Python, Ruby, Go, or PowerShell) for automation, data parsing, and platform tooling
Strong knowledge of log and event data modeling, parsing (grok/regex), enrichment, and schema management (e.g., ECS), including troubleshooting ingestion issues end-to-end
Experience with query languages and analysis techniques (KQL/Lucene, Elasticsearch DSL), and ability to build actionable visualizations and detections from telemetry
Hands-on experience with index lifecycle management (ILM), data streams, retention policies, and capacity planning for high-volume telemetry workloads
Experience with infrastructure-as-code and automation (e.g., Terraform, Ansible) and CI/CD practices to deploy and manage observability components
Solid understanding of Linux systems, networking fundamentals, and distributed systems concepts as they relate to telemetry, performance, and troubleshooting
Experience integrating telemetry from hybrid environments (data center infrastructure, virtualization, containers/Kubernetes, and cloud services)
Familiarity with Operational Technology (OT) / Industrial Control Systems (ICS) observability concepts and common BMS/EPMS applications such as Inductive Automation Ignition
Working knowledge of complementary observability tooling (e.g., Beats/Elastic Agent, Prometheus, Grafana, OpenTelemetry) and how to integrate telemetry between systems to follow event management practices
Experience operating services with on-call practices, incident management, and post-incident review processes; ability to write clear runbooks
Strong understanding of reliability engineering concepts including observability design, alert fatigue reduction, and measuring user/system impact
Experience working within ITIL or similar operational practices (change, incident, problem management)
Familiarity with regulatory/compliance expectations and secure handling of operational data (e.g., GDPR, PCI, SOX) as applicable
Excellent written and verbal communication skills; ability to translate telemetry needs into standards, dashboards, and alerts that teams adopt
Ability to work independently and collaboratively across infrastructure, network, security, and application teams
Travel required is expected to be less than 5%
Data center industry experience is strongly preferred, but not required

Physical Demands and Special Requirements

The physical demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. Reasonable accommodation may be made to enable individuals with disabilities to perform the essential functions.

While performing the duties of this job, the employee is occasionally required to stand; walk; sit; use hands to handle or feel objects; reach with hands and arms; climb stairs; balance; stoop or kneel; talk and hear. The employee must occasionally lift and/or move up to 25 pounds.

Additional Details

Salary Range: $135,000 – $145,000 Base + Bonus (this range is based on Colorado market data and may vary in other locations)

This position is eligible for company benefits including but not limited to medical, dental, and vision coverage, life and AD&D, short and long-term disability coverage, paid time off, employee assistance, participation in a 401k program that includes company match, and many other additional voluntary benefits.

Compensation for the role will depend on a number of factors, including your qualifications, skills, competencies, and experience and may fall outside of the range shown.
#LI-CM1 #LI-Remote

We operate with No Ego and No Arrogance. We work to build each other up and support one another, appreciating each other’s strengths and respecting each other’s weaknesses. We find joy in our work and each other, actively seeking opportunities to inject fun into what we do. Our hard and efficient work is rewarded with an above market total compensation package. We offer a comprehensive suite of health and welfare, retirement, and paid leave benefits exceeding local expectations.

Throughout the year, the advantage of being part of the Vantage team is evident with an array of benefits, recognition, training and development, and the knowledge that your contribution adds value to the company and our community.

Don't meet all the requirements? Please still apply if you think you are the right person for the position. We are always keen to speak to people who connect with our mission and values.

Vantage Data Centers is an Equal Opportunity Employer

Vantage Data Centers does not accept unsolicited resumes from search firm agencies. Fees will not be paid in the event a candidate submitted by a recruiter without an agreement in place is hired; such resumes will be deemed the sole property of Vantage Data Centers.

We’ll be accepting applications for at least one week from the date this role is posted. If you're interested, we encourage you to apply soon—we’re excited to find the right person and will keep the role open until we do!

Observability Elasticsearch Logstash Kibana ELK SRE Metrics SLI SLO ILM Logstash Pipelines KQL Terraform Ansible Python Prometheus Grafana OpenTelemetry OT SCADA Data Center

Average salary estimate

$140000 / YEARLY (est.)

min

max

$135000K

$145000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Sr. PCBA AMFGE Test Engineer, Amazon Leo

Amazon Hybrid California, USA

VIEW

Posted 13 hours ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Transparent & Candid

Growth & Learning

Fast-Paced

Collaboration over Competition

Take Risks

Friends Outside of Work

Passion for Exploration

Customer-Centric

Reward & Recognition

Feedback Forward

Rapid Growth

Medical Insurance

Paid Time-Off

Maternity Leave

Mental Health Resources

Equity

Paternity Leave

Fully Distributed

Flex-Friendly

Some Meals Provided

Snacks

Social Gatherings

Pet Friendly

Company Retreats

Dental Insurance

Life insurance

Health Savings Account (HSA)

Lead deployment and improvement of complex PCBA test systems at contract manufacturers for Amazon Leo’s satellite hardware, providing onsite debug, training, and failure analysis.

RIE Engineer

Harvard University Hybrid Cambridge, Massachusetts, United States

VIEW

Posted 14 hours ago

Experienced equipment engineer needed to manage and maintain RIE, CVD/PVD tools and gas systems in Harvard's high-volume CNS nanofabrication cleanroom while supporting research processes, safety, and user training.

“Launch” into Control Systems Engineering

Rolls-Royce Holdings plc Hybrid Indianapolis

VIEW

Posted 15 hours ago

Launch your engineering career at Rolls‑Royce as an entry-level Control Systems Engineer working on power and propulsion control systems in Indianapolis.

Project Engineer- Electrical Systems Manufacturing Support Engineer

Diversified Services Network, Inc. Hybrid No location specified

VIEW

Posted 16 hours ago

Project Engineer needed to support large-engine electrical systems as a manufacturing liaison focused on build issue resolution, quality improvements, and traceability enhancements.

Cloud Infrastructure Engineer

Game Plan Tech Hybrid United States Remote

VIEW

Posted 9 hours ago

Game Plan Tech is looking for a Cloud Infrastructure Engineer to build and operate secure, automated GCP environments for federal clients using Terraform, Ansible, and Kubernetes.

Development Engineer

AbbVie Hybrid Tempe, AZ, USA

VIEW

Posted 13 hours ago

An engineering role at AbbVie responsible for independently designing, testing, and improving products and processes while coordinating resources and mentoring technicians within a regulated healthcare environment.

Staff Controls & Automation Engineer

Mainspring Energy Hybrid Menlo Park, CA

VIEW

Posted 20 hours ago

Mainspring Energy seeks a Staff Controls & Automation Engineer to lead design and deployment of scalable control systems and automation for its manufacturing operations.

Sr. Production Systems Engineer

Awesome Motive Hybrid Mountain View

VIEW

Posted 56 minutes ago

Lead the architecture and scaling of production business systems, factory infrastructure, and analytics to enable Reliable Robotics' transition from prototype to certifiable autonomous aircraft production.

Intel Process Integration Engineer - (FE)

Intel Hybrid US, Oregon, Hillsboro

VIEW

Posted 4 hours ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Growth & Learning

Transparent & Candid

Customer-Centric

Snacks

Onsite Gym

Family Coverage (Insurance)

Medical Insurance

Dental Insurance

Vision Insurance

Mental Health Resources

Life insurance

Disability Insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

Learning & Development

Paid Time-Off

401K Matching

Maternity Leave

Paternity Leave

Intel Foundry is hiring a Process Integration Engineer in Hillsboro to develop and integrate front-end semiconductor process flows, improve yield, and enable device performance at advanced nodes.

Process Integration Development Engineer

Intel Hybrid US, Oregon, Hillsboro

VIEW

Posted 11 hours ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Growth & Learning

Transparent & Candid

Customer-Centric

Snacks

Onsite Gym

Family Coverage (Insurance)

Medical Insurance

Dental Insurance

Vision Insurance

Mental Health Resources

Life insurance

Disability Insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

Learning & Development

Paid Time-Off

401K Matching

Maternity Leave

Paternity Leave

Drive process integration and technology transfer at Intel Foundry to turn development innovations into reliable, high-volume semiconductor manufacturing solutions.

Tactical Communications & Networking Engineer

NODA AI Hybrid Austin

VIEW

Posted 9 hours ago

NODA is hiring a Tactical Communications & Networking Engineer to design, deploy, and troubleshoot tactical radio and mesh networking solutions that enable resilient multi-platform autonomy in contested environments.

Propulsion Performance Engineer (Structural Analyst/Designer)

Mach Industries Hybrid San Luis Obispo

VIEW

Posted 12 hours ago

Mach Industries is hiring a Propulsion Performance Engineer to analyze and design rotating and stationary turbine components, guiding structural capability, testing, and rapid iteration for production-grade small turbine engines.

Senior Director, Solutions Architecture/Engineering (Remote: USA)

Cologix Hybrid No location specified

VIEW

Posted 12 hours ago

Senior Director to lead and grow the Solutions Architecture/Engineering (SASE) team, delivering technical pre-sales support and architecting colocation, networking and hybrid cloud solutions for enterprise and hyperscale customers across North America.

v vantagedc

4 jobs

MATCH

Calculating your matching score...

FUNDING

Growth

DEPARTMENTS

Engineering

SENIORITY LEVEL REQUIREMENT

Senior Level

TEAM SIZE

No info