Citigroup

Global investment banking and financial services

GenAI Site Reliability Engineering Architect - Senior Vice President

职能DevOps

级别高管级

地点BANGALORE, Karnātaka, India

方式现场办公

类型全职

发布2周前

立即申请

About the Role

We're seeking an exceptional Site Reliability Engineering Architect to lead the technical vision and operational excellence of our enterprise GenAI platform serving 180,000+ Citi employees globally. This is a senior individual contributor role for someone who wants to architect intelligent, self-healing infrastructure at the intersection of AI and reliability engineering—without the overhead of people management.

You'll work with cutting-edge AI infrastructure including Claude, Gemini, and proprietary Citi models running on Open Shift/Kubernetes, building the next generation of AI-Ops capabilities that transform traditional operations into intelligent, autonomous systems.

About Our Team

Our team operates like a research-driven startup within Citi, rapidly innovating on AI operations while maintaining enterprise-grade reliability, security, and compliance. We build and operate Citi Stylus Workspaces and other mission-critical GenAI platforms that demand exceptional reliability, security, and performance at global scale.

What You'll Do

Platform Architecture & Reliability

Design and architect highly available, GPU-accelerated Open Shift clusters optimized for GenAI workloads
Build Model-as-a-Service platforms enabling seamless LLM hosting, inference, and lifecycle management
Architect multi-cluster, multi-region infrastructure supporting global AI platform availability (99.9%+ SLA)
Implement intelligent resource scheduling and optimization for GPU workloads and AI inference engines

AI-Ops & Intelligent Automation

Design and implement agentic AI workflows for automated incident detection, diagnosis, and remediation
Build Model Context Protocol (MCP) integrations enabling AI-driven operational decision-making
Create self-healing systems leveraging log analysis, anomaly detection, and automated remediation pipelines
Transform operational toil into intelligent automation that learns and adapts

Observability & Performance

Design and implement comprehensive observability stacks with Prometheus and Grafana providing deep visibility into AI workloads
Build custom metrics, exporters, and dashboards for LLM-specific monitoring (token throughput, inference latency, GPU utilization)
Establish SLO/SLI frameworks and error budget management for AI services
Drive performance optimization through data-driven analysis

Platform Engineering & Git Ops

Architect and deploy Open Shift operators for AI/ML workloads (Open Shift AI, NVIDIA GPU Operator, Knative)
Design custom Kubernetes operators and controllers for platform-specific automation needs
Architect and maintain Git Ops-driven deployment pipelines for multi-cluster AI infrastructure
Manage cluster lifecycle operations including upgrades, patching, and capacity expansion

Technical Leadership

Define technical vision and roadmap for GenAI platform reliability and operational excellence
Lead production incident response, root cause analysis, and blameless post-mortem processes
Provide technical mentorship to SRE and DevOps teams on advanced automation and AI-Ops practices
Partner with engineering, security, and business leaders to align infrastructure strategy with organizational objectives

What You Bring

Core Technical Expertise (Must-Have)

Open Shift & Kubernetes Mastery

5+ years expert-level Open Shift 4.x administration and architecture experience
5+ years deep Kubernetes expertise including custom operators, controllers, and CRDs
Hands-on experience with Red Hat Advanced Cluster Management (RHACM) and multi-cluster operations
Experience designing and implementing Kubernetes operators using Operator SDK or similar frameworks

AI/ML Infrastructure & Operations

Practical experience deploying and operating AI/ML platforms (Open Shift AI, Kubeflow, or similar)
Knowledge of GPU cluster provisioning, NVIDIA GPU Operator, and accelerated computing workloads
Understanding of LLM inference optimization and model serving frameworks (vLLM, TensorRT, ONNX)
Experience with Model-as-a-Service architectures and MLOps lifecycle management

Automation & Infrastructure as Code

5+ years expert-level experience with Terraform and Ansible for infrastructure provisioning and configuration management
Strong scripting skills: Python, Bash, PowerShell for automation and tooling
Experience with Git Ops workflows and declarative infrastructure management
Proficiency with Helm charts and Kubernetes manifest templating

Observability & Reliability Engineering

Deep expertise in Prometheus, Grafana, and metrics-driven reliability engineering
Experience designing custom metrics, exporters, and dashboards for specialized workloads
Knowledge of distributed tracing and log aggregation (Splunk or similar)
Understanding of SLO/SLI frameworks and error budget management

Cloud & Hybrid Infrastructure

Experience with AWS and Azure cloud platforms and hybrid cloud architectures
Knowledge of GPU instance types and cost optimization strategies
Understanding of cloud-native networking, storage, and security patterns
Familiarity with v Sphere and on-premises virtualization platforms

Emerging AI-Ops Capabilities (Highly Valued)

Experience implementing agentic AI workflows and autonomous remediation systems
Knowledge of Model Context Protocol (MCP) or similar AI orchestration frameworks
Practical experience with AI-driven anomaly detection and predictive analytics
Familiarity with serverless frameworks (Knative) and event-driven architectures

Professional Experience

15+ years of overall infrastructure, DevOps, or SRE experience
5+ years in senior SRE, DevOps Architect, or Platform Engineering leadership roles
5+ years hands-on experience with Open Shift/Kubernetes in production environments
3+ years practical experience with AI/ML infrastructure and operations
Experience managing enterprise-scale platforms (100,000+ users, multi-region deployments)
Track record of successfully delivering complex infrastructure modernization projects
Experience operating in regulated industries (finance, healthcare, government)

Nice to Have

Experience with Go programming language for building operators, controllers, or automation tools
Familiarity with CI/CD tools (Jenkins, Bitbucket, Git)
Experience with service mesh implementations (Istio)
Understanding of enterprise security frameworks and compliance requirements (SOC2, PCI-DSS)
Experience with secrets management (Vault or similar)
Knowledge of policy-as-code frameworks (OPA, Kyverno)

Who You Are

Beyond technical skills, you are:

Innovative problem solver who transforms complex operational challenges into scalable solutions
Passionate about AI-Ops and leveraging AI to revolutionize traditional reliability engineering
Hands-on technical leader comfortable diving deep into technical details while maintaining strategic perspective
Relentlessly focused on eliminating toil through intelligent automation
Data-driven with strong analytical skills and ability to use metrics to drive improvements
Excellent communicator able to articulate complex technical concepts to diverse audiences
Collaborative with experience working across teams (engineering, security, business)
Curious about emerging technologies with commitment to staying current
Pragmatic with ability to balance ideal solutions with practical constraints and timelines
Calm under pressure with strong troubleshooting and crisis management skills

Job Family Group:

Technology

Job Family:

Architecture

Time Type:

Full time

Most Relevant Skills

Please see the requirements listed above.

Other Relevant Skills

For complementary skills, please see above and/or contact the recruiter.

Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.

If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.

View Citi’s EEO Policy Statement and the Know Your Rights poster.

浏览量

申请点击

Mock Apply

相似职位

LEAD ADMINISTRATOR L1

Wipro · Hyderabad, India

DEVOPS LEAD L1

Wipro · Pune, India

DevOps Lead

Applied Materials · Mumbai, India

Lead Administrator (Tools & Automation)

HCL Technologies · Lucknow, India

Track Manager - Kubernetes, Terraform

HCL Technologies · Gautam Buddha Nagar, India

关于Citigroup

Citigroup

Public

Citigroup Inc. or Citi is an American multinational investment bank and financial services company based in New York City. The company was formed in 1998 by the merger of Citicorp, the bank holding company for Citibank, and Travelers; Travelers was spun off from the company in 2002.

10,001+

员工数

New York City

总部位置

$86B

企业估值

评价

10条评价

3.7

10条评价

工作生活平衡

3.8

薪酬

2.5

企业文化

4.0

职业发展

3.2

管理层

3.5

65%

推荐率

优点

Good work-life balance

Supportive management and colleagues

Good benefits

缺点

Low or uncompetitive salary/pay

Long hours during peak times

Poor management and lack of direction

薪资范围

48个数据点

Mid/L4

Senior/L5

Staff/L6

Mid/L4 · Business Analytics Senior Analyst

3份报告

$117,000

年薪总额

基本工资

$120,800

股票

奖金

$117,000

面试评价

3条评价

难度

3.3

/ 5

时长

14-28周

体验

正面 0%

中性 33%

负面 67%

面试流程

Application Review

Recruiter Screen

Technical Interview

Panel/Group Interview

Final Round

Offer

常见问题

Technical Knowledge

Coding/Algorithm

Behavioral/STAR

Past Experience

Culture Fit