GE Vernova

Energy technology company

SRE Observability & SLO Engineer

직무DevOps

경력미들급

위치Hyderabad TS IN 26

근무오피스 출근

고용정규직

게시1개월 전

지원하기

필수 스킬

AWS

Kubernetes

Job Description Summary

GE Vernova's GridOS Platform Engineering team is building the next generation of SaaS reliability for critical energy infrastructure.The Observability & SLO Engineer is the eyes and ears of the GridOS SRE team. In this role you will build and own the full telemetry stack — from instrumentation standards to SLO dashboards to synthetic monitors — that give GE Vernova and its utility customers real-time confidence in the reliability of mission-critical energy management systems. This is a cyclical, high-impact position: you will drive an intensive initial ramp to establish v1.0 observability coverage across all customer environments, then shift into an ongoing improvement cadence aligned to new product releases and customer onboarding.

Job Description Roles and Responsibilities Telemetry Standards & Architecture

Implement organization-wide telemetry standards covering metrics, logs, and distributed traces across all GridOS SaaS services.
Implement metrics collection for Kubernetes-hosted services (EKS/Rancher) including pod-level, namespace-level, and cluster-level metrics.
Working with the SRE Lead and SRE Platform Engineers help define and implement data retention policies, cardinality budgets, and telemetry cost controls to keep observability economically sustainable.
Publish and maintain an Observability Runbook library covering onboarding, alert tuning, and dashboard standards for Platform SRE and Production DevOps teams.

SLO Definition, Tooling & Governance

Partner with product engineering, Platform SRE, and customer stakeholders to define meaningful Service Level Indicators (SLIs) and Service Level Objectives (SLOs) per product and customer tier.
Build and maintain SLO tooling — error budget burn-rate alerts, burn-rate dashboards, and automated SLO compliance reports.
Govern the SLO review cycle: facilitate monthly SLO reviews, identify reliability risks early, and drive prioritization of reliability work with the SRE Lead.
Translate SLOs into SLAs for customer-facing commitments in coordination with the SRE Team Lead.

Dashboards & Alerting

Design and build operational dashboards covering availability, latency, error rates, and saturation (the 'Golden Signals') for every GridOS SaaS product.
Implement alert policies with noise-reduction practices: symptom-based alerting, multi-window burn-rate rules, and alert deduplication.
Create executive-level dashboards for SRE leadership and customer-facing uptime/availability reports aligned to contractual SLAs.
Establish and maintain alert routing, escalation policies, and on-call schedules in coordination with the incident response workflow.

Synthetic Monitoring

Design and implement a synthetic monitoring plan covering critical user journeys for each GridOS SaaS product and customer environment.
Build synthetic checks for API health, UI flows, and integration endpoints using AWS CloudWatch Synthetics or equivalent tooling.
Define alerting thresholds for synthetic monitors and integrate them into the broader incident detection pipeline.

Continuous Improvement Cadence

After v1.0 delivery, transition into a roadmap-aligned improvement cycle: expand coverage for new features, tune alert signal-to-noise, and retire stale monitors.
Conduct periodic observability health reviews to identify gaps in coverage, reduce MTTD (Mean Time to Detect), and improve MTTR (Mean Time to Resolve).
Collaborate with the Production DevOps engineer on Fin Ops validation — correlate infrastructure cost metrics with performance and reliability data.

Required Experience

3–5 years in SRE, observability engineering, or infrastructure reliability roles.
Deep expertise with at least one major observability platform — Datadog, Grafana + Prometheus, AWS CloudWatch, Dynatrace, or New Relic.
Hands-on experience implementing SLIs, SLOs, and error budget burn-rate alerting in a production SaaS environment.
Strong understanding of distributed systems telemetry: metrics (Prometheus/CloudWatch), structured logging (CloudWatch Logs Insights, ELK), and distributed tracing (Open Telemetry, AWS X-Ray).
Experience with Kubernetes observability — kube-state-metrics, node exporters, Helm-deployed monitoring stacks, and namespace-level resource metrics.
Proficiency in at least one query/visualization language: PromQL, Splunk SPL, Datadog Query Language, or CloudWatch Logs Insights query syntax.
Experience designing alerting strategies that minimize alert fatigue through symptom-based and burn-rate approaches.
Scripting skills in Python and/or Bash for automation of monitoring configuration and report generation.

Key Skills and Technologies

Cloud Technologies
AWS Cloud Infrastructure
EKS, RDS, MSK, S3, EC2, EBS, SQS, etc.
Kubernetes
EKS, Rancher
Infrastructure as Code: Terraform
Deployment and Configuration Tools
Ansible, Chef or Puppet
Telemetry standards and tools
Open Telemetry, CloudWatch, Cloudtrail
Observability tools and technology
Datadog, Splunk, New Relic, etc.
Alerting and notification
AWS and Azure alerting notification
Scripting
Go, Python, Groovy, Bash
Strong Linux Administration Skills
Strong analytical and problem solving skills

Nice to Have

Familiarity with Open Telemetry (OTel) for vendor-agnostic instrumentation.
Experience with synthetic monitoring tools — AWS CloudWatch Synthetics, Datadog Synthetics, or Catchpoint.
Knowledge of chaos engineering practices for reliability validation (Chaos Monkey, AWS Fault Injection Simulator).
Exposure to AIOps or ML-driven anomaly detection features within observability platforms.
Experience in regulated industries — energy, utilities, healthcare — where compliance-grade audit trails are required.
AWS certifications: CloudWatch / Observability specialty, Solutions Architect Associate or Professional.

Education Qualification

Bachelor's Degree in Computer Science, Information Management or in “STEM” Majors (Science, Technology, Engineering and Math) with basic experience.
Leadership:

Influences through others; builds direct and "behind the scenes" support for ideas.
Preemptively sees downstream consequences and effectively tailors influencing strategy to support a positive outcome.
Able to verbalize what is behind decisions and downstream implications.
Continuously reflecting on success and failures to improve performance and decision-making.
Understands and encourages change when needed.
Proactively identifies and removes project obstacles or barriers on behalf of the team.
Able to navigate accountability in a matrixed organization.
Self-starter; communicates and demonstrates a shared sense of purpose. Learns from failure.

Personal Attributes:

Critical thinker; able to quickly adapt to changing environments
A hacker or tinkerer at heart
Risk taker, not afraid to think outside the box or challenge the status quo
Emotional Intelligence, ability to influence up and out and the ability to work independently
Must be a team player with a strong desire to win
Passionate about continuously learning
Highly organized and efficient; able to balance competing priorities and execute accordingly
Strong oral and written communication skills.

Additional Information

Relocation Assistance Provided: Yes

전체 조회수

전체 지원 클릭

전체 Mock Apply

전체 스크랩

비슷한 채용공고

SeniorAdministrator - Identity Management

HCL Technologies

CYBER SECURITY ANALYST L4

Wipro · Bengaluru, India

Platform Architect

Mastercard · Mexico City, Mexico

IBM AS400 - Administration - 5 to 8 Yrs -Pune , Bangalore , Chennai , Hyderabad

Wipro · Bengaluru, India

Software Developer - Confidential Computing Platform

JPMorgan Chase · Plano, TX, United States, US

GE Vernova 소개

GE Vernova

Public

GE Vernova, Inc. is an energy equipment manufacturing and services company headquartered in Cambridge, Massachusetts.

10,001+

직원 수

Boston

본사 위치

$16B

기업 가치

리뷰

10개 리뷰

3.8

10개 리뷰

워라밸

3.2

보상

3.8

문화

3.9

커리어

3.4

경영진

3.7

65%

지인 추천률

장점

Supportive and approachable management

Excellent benefits and retirement plans

Professional development opportunities

단점

Heavy workload and frequent overtime

High expectations and stress

Limited growth opportunities

연봉 정보

118개 데이터

Senior/L5

Senior/L5 · GLOBAL SECURITY DIRECTOR

1개 리포트

$253,000

총 연봉

기본급

$220,000

주식

보너스

$253,000

면접 후기

후기 4개

난이도

3.3

/ 5

소요 기간

14-28주

경험

긍정 0%

보통 75%

부정 25%

면접 과정

Application Review

HR Screen

Technical Phone Screen

Hiring Manager Interview

Final Technical Round

Offer

자주 나오는 질문

Technical Knowledge

Behavioral/STAR

Past Experience

Coding/Algorithm

최근 소식

GE Vernova Wins Contract to Upgrade Power Plants in Egypt - Yahoo Finance

Yahoo Finance

News

1w ago

Why GE Vernova Stock Slid Today - The Motley Fool

The Motley Fool

News

1w ago

GE Vernova: The Warning Signs That Nobody Is Paying Attention To Right Now (NYSE:GEV) - Seeking Alpha

Seeking Alpha

News

1w ago

GE Vernova Expands German Wind Deals As Valuation And Momentum Diverge - Yahoo Finance

Yahoo Finance

News

1w ago