Jobs

Infrastructure Associate Advisor - HIH - Evernorth

Cigna

Hyderabad, India

On-site

Full-time

4d ago

Section 2: Position Overview

Provide a concise summary of the position, its purpose, and its importance within the company.

Section 1: Job Title

Clearly state the position’s title to convey the role and level of responsibility.

Infrastructure Engineering Associate Advisor Position Overview

The Pharmacy Benefit Services+ Technology organization is seeking a Site Reliability Engineer (SRE) – Automation, Self‑Healing & AI/AIOps to join our team. This Band 4 Contributor role is a senior, hands‑on position responsible for driving enterprise reliability outcomes, reducing operational toil, and enabling scalable SRE adoption across both legacy platforms and modern cloud‑native systems.

In this role, you will lead the design and implementation of intelligent, automated, and AI‑assisted reliability solutions that ensure systems are resilient, observable, self‑healing, and continuously improving. You will operate at the intersection of software engineering, operations, automation, and AI, influencing how teams design, deploy, and operate production systems.

A core focus of this role is building automation‑first and agentic SRE capabilities, including:

Self‑healing workflows that automatically detect, diagnose, and remediate failures
AI‑driven operational intelligence (AIOps) for anomaly detection, alert correlation, incident triage, and guided remediation
Standardized SRE enablement platforms (SLO automation, reliability scorecards, FMEA workflows) that can be adopted at scale with minimal friction

You will collaborate closely with application teams, platform engineering, DevOps, infrastructure, QE, and IT leadership to embed reliability into the SDLC and runtime operations.

Your contributions will directly support:

Improved system availability and resilience through proactive reliability engineering and automation
Reduced incidents and faster MTTR via self‑healing and AI‑assisted operations
Higher developer productivity by eliminating manual operational toil
Faster, safer releases by integrating SRE controls into CI/CD pipelines
Measurable reliability improvements, including reductions in MTTD/MTTR, decreased incident frequency, improved SLO compliance, and healthier error‑budget consumption through automation and AI‑enabled operations

Section 3: Responsibilities

Clearly outline the primary duties and tasks associated with the role. Use action verbs (i.e., lead, drive, analyze, assess, research, etc.) to convey expectations.

Responsibilities Core SRE & Reliability Engineering

Define, implement, and operationalize SRE best practices including SLIs, SLOs, error budgets, reliability reviews, and operational readiness standards across multiple teams and platforms.
Act as a senior reliability engineer for mission‑critical systems, influencing architectural decisions to improve availability, scalability, and fault tolerance.
Lead blameless incident response, root‑cause analysis, and post‑incident reviews, ensuring systemic fixes and automation are prioritized.

Self‑Healing Automation

Design and implement self‑healing systems that:
Automatically detect failures using telemetry and signals
Diagnose probable root causes using rules and AI/ML models
Execute automated remediation actions (restart, scale, reroute, rollback, configuration correction)
Build event‑driven automation workflows integrated with monitoring, CI/CD, and infrastructure platforms to reduce human intervention.
Develop and maintain automated runbooks and remediation pipelines that evolve based on historical incidents and outcomes.
Extend self‑healing automation to change and release workflows, including automated rollback, safeguarded change execution, and AI‑assisted change‑risk evaluation to reduce change‑related incidents.
Continuously identify and eliminate operational toil by replacing repetitive manual work with automation, self‑service tooling, and intelligent remediation.

AI / Agentic SRE (AIOps)

Apply AI and ML techniques to improve operational intelligence, including:
Anomaly detection across metrics, logs, and traces
Alert deduplication, correlation, and noise reduction
Intelligent incident summarization and impact analysis
Predictive failure detection and capacity‑risk forecasting
Implement agentic AI patterns where autonomous or semi‑autonomous agents:
Continuously monitor system health
Propose or execute remediation actions
Learn from past incidents and operator feedback
Establish continuous learning feedback loops to tune AI models and agent behavior based on false positives, incident outcomes, and operator review.
Partner with platform and security teams to ensure responsible, secure, and compliant use of AI in production operations.

Observability & Telemetry

Establish observability‑by‑design standards for applications and platforms (metrics, logs, traces, events).
Improve signal quality and alerting strategies to focus on user and business impact, not infrastructure noise.
Build and maintain reliability dashboards and scorecards that provide real‑time and historical insights into service health.

Resilience, Performance & Validation

Drive resilience and fault‑tolerance validation using chaos engineering and controlled failure injection.
Partner with performance and platform teams to ensure systems meet performance, scalability, and recovery objectives.
Promote safe testing practices for legacy‑integrated systems (e.g., service virtualization where direct backend calls pose risk).

CI/CD & Platform Enablement

Embed SRE controls into CI/CD pipelines, including:
SLO validation gates
Automated canary analysis
Release health checks and rollback triggers
Build reusable SRE platforms, templates, and onboarding kits that enable teams to adopt reliability practices with minimal manual effort.
Mentor engineers and act as a technical leader for SRE adoption across the organization.

Section 4: Qualifications

Specify the skills, experience, and education required for the role. Differentiate between the “must-haves” and “nice-to-haves”.

Required skills: List the specific skills required for the job, including technical, leadership skills, and any industry-specific skills.

Required Experience: Clearly state any mandatory requirements, such as formal education, certifications, licenses, or specific years of experience.

Desired Experience: List any “nice-to-have” experience, including industry experience, exposure to specific technologies, certifications, etc.

Qualifications Required Skills:

Site Reliability Engineering: Deep hands‑on experience with SLOs, error budgets, incident management, and production operations.
Automation & Software Engineering: Strong development skills in Python, Go, Java, or similar, with the ability to build production‑grade automation and services.
Self‑Healing Systems: Proven experience designing and implementing automated remediation and closed‑loop recovery workflows.
AI / AIOps: Experience applying AI/ML to operations, such as anomaly detection, alert correlation, predictive analysis, or intelligent remediation.
Observability: Expertise with platforms such as Dynatrace, Prometheus, Grafana, Splunk, App Dynamics, or equivalent.
Cloud & Distributed Systems: Strong understanding of AWS / Azure / GCP, microservices, and Kubernetes / Open Shift.
CI/CD & DevOps: Experience integrating reliability checks and automation into delivery pipelines.
Infrastructure as Code: Terraform, CloudFormation, or similar.
Legacy + Modern Engineering: Ability to support and modernize reliability practices across monoliths, batch jobs, messaging, and mainframe‑integrated systems.
Leadership & Influence: Ability to lead through influence, mentor others, and drive adoption across multiple teams.

Required Experience & Education:

Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent experience).
7+ years of experience in SRE, DevOps, platform engineering, or production software engineering roles.
Demonstrated success delivering enterprise‑scale automation, self‑healing, and reliability improvements.

Desired Experience:

Experience building or contributing to enterprise SRE enablement platforms (SLO automation, reliability scorecards, FMEA workflows).
Hands‑on experience with chaos engineering and resilience testing in production‑like environments.
Familiarity with Service Now / CMDB / service modeling to support operational readiness and dependency visibility.
Experience applying Generative AI for operational use cases such as runbook generation, incident summarization, and knowledge retrieval.
Demonstrated delivery of quantifiable reliability improvements (e.g., MTTR reduction, incident volume reduction, improved SLO adherence).
Experience mentoring engineers and shaping an automation‑first, reliability‑driven culture.

These two sections will be “standardized” in the JD template and made not editable.

Location & Hours of Work

Full-time position, working 40 hours per week. Expected overlap with US hours as appropriate
Primarily based in the Innovation Hub in Hyderabad, India in a hybrid working model (3 days WFO and 2 days WAH)

Equal Opportunity Statement

Evernorth is an Equal Opportunity Employer actively encouraging and supporting organization-wide involvement of staff in diversity, equity, and inclusion efforts to educate, inform and advance both internal practices and external work with diverse client populations.

About Evernorth Health Services

Evernorth Health Services, a division of The Cigna Group, creates pharmacy, care and benefit solutions to improve health and increase vitality. We relentlessly innovate to make the prediction, prevention and treatment of illness and disease more accessible to millions of people. Join us in driving growth and improving lives.

Total Views

Apply Clicks

Weekly mock applicants

Bookmarks

Similar jobs

Associate DevOps Engineer

DHL · Chennai, Tamil Nādu, India

Site Reliability Engineer - Associate - Reliability & Production Engineering

Morgan Stanley · Bengaluru, Karnataka, India

Client Experience - Associate

JPMorgan Chase · Bengaluru, Karnataka, India, IN

Site Reliability Engineer, Associate

BlackRock · Gurgaon, India

Associate Command Center Engineer, CloudOps

Stryker · Bengaluru, India

About Cigna

Cigna

Public

The Cigna Group is an American multinational for-profit managed healthcare and insurance company based in Bloomfield, Connecticut.

10,001+

Employees

Bloomfield

Headquarters

$54B

Valuation

Reviews

3.7

10 reviews

Work-life balance

4.2

Compensation

2.8

Culture

4.1

Career

3.5

Management

3.2

65%

Recommend to a friend

Pros

Supportive and encouraging management

Excellent work-life balance and flexible hours

Great health benefits and vacation time

Cons

Below market compensation and low pay

Poor management and lack of transparency

High workload and stress levels

Salary Ranges

38 data points

L2 · Cybersecurity Analyst L2

0 reports

$66,170

total per year

Base

$26,468

Stock

$33,085

Bonus

$6,617

$46,319

$86,021

Interview experience

4 interviews

Difficulty

2.8

/ 5

Duration

14-28 weeks

Offer rate

50%

Experience

Positive 50%

Neutral 0%

Negative 50%

Interview process

Application Review

Recruiter Screen

Technical Phone Screen

Team Member Interviews

Panel/Multiple Interviews

Offer

Common questions

Coding/Algorithm

Technical Knowledge

Behavioral/STAR

Past Experience

Culture Fit

News & Buzz

Moran Wealth Management LLC Grows Holdings in Cigna Group $CI - MarketBeat

MarketBeat

News

5d ago

Mixed-use plans move forward for old Cigna building in Hooksett - UnionLeader.com

UnionLeader.com

News

6d ago

The Cigna Group Foundation Invites Memphis-Area Nonprofits to Apply for $250,000 Grants to Improve Health Care Access - PR Newswire

PR Newswire

News

6d ago

KBC Group NV Sells 38,820 Shares of Cigna Group $CI - MarketBeat

MarketBeat

News

1w ago