招聘

Senior Incident Optimization & Reliability Specialist - End-User Technology – Vice President

Citigroup

CHENNAI, Tamil Nādu, India

On-site

Full-time

2w ago

Position Information

Position Information
Job Title:

Senior Incident Optimization & Reliability Specialist

End-User Technology
Job Level:

C-13

Department:

Foundational Services

Production Operations
Location:

Chennai, India

Must to have 10-16 years of proven experience in infrastructure operations, software engineering. Primarily as Incident Optimization specialist, as Site Reliability Engineer (SRE) for End-User Computing Position Summary

The Senior Incident Optimization & Reliability Specialist serves as a critical bridge between our Technology Incident Optimization Program and the core End-User Technology domains, including cloud desktop infrastructure, Microsoft productivity tools, content management, and conference/video platforms. This role demands deep technical expertise combined with a strategic, data-driven mindset to drive tactical incident reduction while architecting the future state of intelligent event management and automation for end-user services.

Applying core Site Reliability Engineering (SRE) principles, you will be responsible for maturing our observability posture, building automated incident remediation workflows, and achieving measurable reductions in operational toil. By focusing on intelligent event management, automation, and continuous improvement, you will enhance the reliability and performance of services that are critical to our end-users. This position offers the unique opportunity to shape the future of a highly reliable, automated enterprise environment.

Key Responsibilities

Incident & Alert Analysis:

Conduct comprehensive analysis of alert and incident patterns to identify top sources of operational noise, determine root causes, and develop data-driven strategies for reduction.

Intelligent Event Management:

Design, implement, and optimize rules for event correlation, de-duplication, and suppression on AIOps and event management platforms. Develop domain-specific correlation logic leveraging configuration management data and end-user service topology.

Automation & Toil Reduction:

Architect and develop automation playbooks for incident data enrichment and create self-healing capabilities to reduce manual intervention (toil) for common and recurring end-user technology incident scenarios.

Observability Maturity:

Assess the current observability footprint across all end-user technology domains. Identify gaps and drive enhancements in telemetry, logging, and tracing to provide deeper insights and enable proactive issue detection.

Apply SRE Principles:

Champion and apply core SRE practices to systematically improve service reliability. This includes contributing to the definition of Service Level Objectives (SLOs), using a data-driven approach to continuous improvement.

Cross-Functional Collaboration:

Partner closely with end-user services, engineering, and platform teams to understand incident drivers, validate correlation logic, and provide expert guidance on event management and reliability best practices.

Quality Assurance:

Continuously validate the effectiveness of implemented rules and automation to ensure no business-impacting alerts are missed. Monitor and report on alert quality metrics and lead iterative improvements.

Required Qualifications

Education:

Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or a related technical field.

Experience:

A minimum of 8+ years of hands-on experience in IT operations, end-user computing, or a related field, with proven experience in incident reduction and operational excellence.

Event Management & Incident Reduction:

Demonstrated success in leading event management and incident reduction initiatives with quantifiable results. Direct, hands-on experience with modern AIOps and enterprise event management platforms (e.g., Big Panda) is required.

Technical Expertise:
Deep understanding of end-user technology ecosystems, including VMWare-hosted cloud desktop infrastructure, Microsoft 365 suite (Teams, Outlook, Office), Share Point, and collaboration platforms.
Expertise with a broad range of domain-specific monitoring and observability tools.
Automation & Orchestration:

Hands-on experience developing robust automation solutions using scripting languages (e.g., Python, PowerShell) and modern automation frameworks to reduce manual tasks.

Data Analysis:

Proficiency in log analysis, pattern recognition, and using query languages for data analysis on log aggregation platforms.

Problem-Solving & Analytical Skills:

Excellent analytical abilities with a systematic approach to troubleshooting complex issues and a holistic view of technology systems.

Communication & Leadership:

Exceptional communication skills with the ability to influence and collaborate effectively across diverse, cross-functional teams.

Preferred Qualifications

An advanced degree (Master's) in a relevant technical field.
Relevant industry certifications (e.g., Microsoft 365, VMWare, ITIL).
Experience with Site Reliability Engineering (SRE) practices and applying them in an enterprise context.
Knowledge of ITSM platforms, CMDB management, and infrastructure-as-code (IaC) principles.
Familiarity with financial services regulatory requirements.

Job Family Group:

Technology

Job Family:

Infrastructure

Time Type:

Full time

Most Relevant Skills

Please see the requirements listed above.

Other Relevant Skills

For complementary skills, please see above and/or contact the recruiter.

Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.

If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.

View Citi’s EEO Policy Statement and the Know Your Rights poster.

总浏览量

申请点击数

模拟申请者数

相似职位

DEVOPS LEAD L1(CONTRACT)

Wipro · Coimbatore, India

MAJOR INCIDENT MANAGER L2(CONTRACT)

Wipro · Mumbai, India

Track Lead - Terraform,Python,Google Cloud Build,Ansible

HCL Technologies · Others, India

Lead EDA Engineer - Devops

NXP Semiconductors · Bangalore

Lead, Platform Engineering

Mastercard · Pune, India

关于Citigroup

Citigroup

Public

Citigroup Inc. or Citi is an American multinational investment bank and financial services company based in New York City. The company was formed in 1998 by the merger of Citicorp, the bank holding company for Citibank, and Travelers; Travelers was spun off from the company in 2002.

10,001+

员工数

New York City

总部位置

$86B

企业估值

评价

3.7

10条评价

工作生活平衡

4.0

薪酬

2.8

企业文化

4.2

职业发展

3.5

管理层

3.3

68%

推荐给朋友

优点

Good work-life balance

Supportive management and colleagues

Good benefits

缺点

Low/uncompetitive salary and pay

Poor management and lack of direction

Heavy workload and long hours

薪资范围

38个数据点

Mid/L4

Senior/L5

Staff/L6

Mid/L4 · Business Risk Intermediate Analyst

1份报告

$77,165

年薪总额

基本工资

$67,100

股票

奖金

$77,165

面试经验

3次面试

难度

3.3

/ 5

时长

14-28周

体验

正面 0%

中性 33%

负面 67%

面试流程

Application Review

HR Screen

Technical Assessment

Hiring Manager Interview

Final Round Interview

Offer Decision

常见问题

Technical Knowledge

Behavioral/STAR

Past Experience

Problem Solving

Culture Fit

新闻动态

Citigroup Tokenized Stock (Ondo): Latest News, Social Media Updates and Insights - CryptoRank

CryptoRank

News

3d ago

Citigroup Inc. $C Stock Position Raised by Merit Financial Group LLC - MarketBeat

MarketBeat

News

3d ago

Top Citigroup Insiders Quietly Cash Out Millions in Stock Sales - TipRanks

TipRanks

News

3d ago

Citigroup (C) Valuation Check After Strong Q1 Earnings Beat And Decade High Quarterly Revenue - Yahoo Finance

Yahoo Finance

News

4d ago