
Global investment banking and financial services
GenAI Site Reliability Engineering Architect - Senior Vice President
About the Role
We're seeking an exceptional Site Reliability Engineering Architect to lead the technical vision and operational excellence of our enterprise GenAI platform serving 180,000+ Citi employees globally. This is a senior individual contributor role for someone who wants to architect intelligent, self-healing infrastructure at the intersection of AI and reliability engineering—without the overhead of people management.
You'll work with cutting-edge AI infrastructure including Claude, Gemini, and proprietary Citi models running on Open Shift/Kubernetes, building the next generation of AI-Ops capabilities that transform traditional operations into intelligent, autonomous systems.
About Our Team
Our team operates like a research-driven startup within Citi, rapidly innovating on AI operations while maintaining enterprise-grade reliability, security, and compliance. We build and operate Citi Stylus Workspaces and other mission-critical GenAI platforms that demand exceptional reliability, security, and performance at global scale.
What You'll Do
Platform Architecture & Reliability
- Design and architect highly available, GPU-accelerated Open Shift clusters optimized for GenAI workloads
- Build Model-as-a-Service platforms enabling seamless LLM hosting, inference, and lifecycle management
- Architect multi-cluster, multi-region infrastructure supporting global AI platform availability (99.9%+ SLA)
- Implement intelligent resource scheduling and optimization for GPU workloads and AI inference engines
AI-Ops & Intelligent Automation
- Design and implement agentic AI workflows for automated incident detection, diagnosis, and remediation
- Build Model Context Protocol (MCP) integrations enabling AI-driven operational decision-making
- Create self-healing systems leveraging log analysis, anomaly detection, and automated remediation pipelines
- Transform operational toil into intelligent automation that learns and adapts
Observability & Performance
- Design and implement comprehensive observability stacks with Prometheus and Grafana providing deep visibility into AI workloads
- Build custom metrics, exporters, and dashboards for LLM-specific monitoring (token throughput, inference latency, GPU utilization)
- Establish SLO/SLI frameworks and error budget management for AI services
- Drive performance optimization through data-driven analysis
Platform Engineering & Git Ops
- Architect and deploy Open Shift operators for AI/ML workloads (Open Shift AI, NVIDIA GPU Operator, Knative)
- Design custom Kubernetes operators and controllers for platform-specific automation needs
- Architect and maintain Git Ops-driven deployment pipelines for multi-cluster AI infrastructure
- Manage cluster lifecycle operations including upgrades, patching, and capacity expansion
Technical Leadership
- Define technical vision and roadmap for GenAI platform reliability and operational excellence
- Lead production incident response, root cause analysis, and blameless post-mortem processes
- Provide technical mentorship to SRE and DevOps teams on advanced automation and AI-Ops practices
- Partner with engineering, security, and business leaders to align infrastructure strategy with organizational objectives
What You Bring
Core Technical Expertise (Must-Have)
Open Shift & Kubernetes Mastery
- 5+ years expert-level Open Shift 4.x administration and architecture experience
- 5+ years deep Kubernetes expertise including custom operators, controllers, and CRDs
- Hands-on experience with Red Hat Advanced Cluster Management (RHACM) and multi-cluster operations
- Experience designing and implementing Kubernetes operators using Operator SDK or similar frameworks
AI/ML Infrastructure & Operations
- Practical experience deploying and operating AI/ML platforms (Open Shift AI, Kubeflow, or similar)
- Knowledge of GPU cluster provisioning, NVIDIA GPU Operator, and accelerated computing workloads
- Understanding of LLM inference optimization and model serving frameworks (vLLM, TensorRT, ONNX)
- Experience with Model-as-a-Service architectures and MLOps lifecycle management
Automation & Infrastructure as Code
- 5+ years expert-level experience with Terraform and Ansible for infrastructure provisioning and configuration management
- Strong scripting skills: Python, Bash, PowerShell for automation and tooling
- Experience with Git Ops workflows and declarative infrastructure management
- Proficiency with Helm charts and Kubernetes manifest templating
Observability & Reliability Engineering
- Deep expertise in Prometheus, Grafana, and metrics-driven reliability engineering
- Experience designing custom metrics, exporters, and dashboards for specialized workloads
- Knowledge of distributed tracing and log aggregation (Splunk or similar)
- Understanding of SLO/SLI frameworks and error budget management
Cloud & Hybrid Infrastructure
- Experience with AWS and Azure cloud platforms and hybrid cloud architectures
- Knowledge of GPU instance types and cost optimization strategies
- Understanding of cloud-native networking, storage, and security patterns
- Familiarity with v Sphere and on-premises virtualization platforms
Emerging AI-Ops Capabilities (Highly Valued)
- Experience implementing agentic AI workflows and autonomous remediation systems
- Knowledge of Model Context Protocol (MCP) or similar AI orchestration frameworks
- Practical experience with AI-driven anomaly detection and predictive analytics
- Familiarity with serverless frameworks (Knative) and event-driven architectures
Professional Experience
- 15+ years of overall infrastructure, DevOps, or SRE experience
- 5+ years in senior SRE, DevOps Architect, or Platform Engineering leadership roles
- 5+ years hands-on experience with Open Shift/Kubernetes in production environments
- 3+ years practical experience with AI/ML infrastructure and operations
- Experience managing enterprise-scale platforms (100,000+ users, multi-region deployments)
- Track record of successfully delivering complex infrastructure modernization projects
- Experience operating in regulated industries (finance, healthcare, government)
Nice to Have
- Experience with Go programming language for building operators, controllers, or automation tools
- Familiarity with CI/CD tools (Jenkins, Bitbucket, Git)
- Experience with service mesh implementations (Istio)
- Understanding of enterprise security frameworks and compliance requirements (SOC2, PCI-DSS)
- Experience with secrets management (Vault or similar)
- Knowledge of policy-as-code frameworks (OPA, Kyverno)
Who You Are
Beyond technical skills, you are:
- Innovative problem solver who transforms complex operational challenges into scalable solutions
- Passionate about AI-Ops and leveraging AI to revolutionize traditional reliability engineering
- Hands-on technical leader comfortable diving deep into technical details while maintaining strategic perspective
- Relentlessly focused on eliminating toil through intelligent automation
- Data-driven with strong analytical skills and ability to use metrics to drive improvements
- Excellent communicator able to articulate complex technical concepts to diverse audiences
- Collaborative with experience working across teams (engineering, security, business)
- Curious about emerging technologies with commitment to staying current
- Pragmatic with ability to balance ideal solutions with practical constraints and timelines
- Calm under pressure with strong troubleshooting and crisis management skills
Job Family Group:
Technology
Job Family:
Architecture
Time Type:
Full time
Most Relevant Skills
Please see the requirements listed above.
Other Relevant Skills
For complementary skills, please see above and/or contact the recruiter.
Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.
If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.
View Citi’s EEO Policy Statement and the Know Your Rights poster.
閲覧数
0
応募クリック
0
Mock Apply
0
スクラップ
0
類似の求人

DEVOPS LEAD L1
Wipro · Pune, India

Cyber Security Lead Analyst - HIH - Evernorth
Cigna · Hyderabad, India

Track Manager - Kubernetes, Terraform
HCL Technologies · Gautam Buddha Nagar, India

Lead Administrator (Tools & Automation)
HCL Technologies · Lucknow, India

DevOps Lead
Applied Materials · Mumbai, India
Citigroupについて

Citigroup
PublicCitigroup Inc. or Citi is an American multinational investment bank and financial services company based in New York City. The company was formed in 1998 by the merger of Citicorp, the bank holding company for Citibank, and Travelers; Travelers was spun off from the company in 2002.
10,001+
従業員数
New York City
本社所在地
$86B
企業価値
レビュー
10件のレビュー
3.7
10件のレビュー
ワークライフバランス
3.8
報酬
2.5
企業文化
4.0
キャリア
3.2
経営陣
3.5
65%
知人への推奨率
良い点
Good work-life balance
Supportive management and colleagues
Good benefits
改善点
Low or uncompetitive salary/pay
Long hours during peak times
Poor management and lack of direction
給与レンジ
48件のデータ
Mid/L4
Senior/L5
Staff/L6
Mid/L4 · Business Analytics Senior Analyst
3件のレポート
$117,000
年収総額
基本給
$120,800
ストック
-
ボーナス
-
$117,000
$117,000
面接レビュー
レビュー3件
難易度
3.3
/ 5
期間
14-28週間
体験
ポジティブ 0%
普通 33%
ネガティブ 67%
面接プロセス
1
Application Review
2
Recruiter Screen
3
Technical Interview
4
Panel/Group Interview
5
Final Round
6
Offer
よくある質問
Technical Knowledge
Coding/Algorithm
Behavioral/STAR
Past Experience
Culture Fit
最新情報
Citigroup : Citi Announces Senior Leadership Appointments to Strengthen International Franchise - marketscreener.com
marketscreener.com
News
·
1w ago
Citigroup Escapes Ex-Employee's Trade Secret Suit, For Now - Law360
Law360
News
·
1w ago
Citigroup vs. Wells Fargo: Which Bank Stock Is a Smarter Buy Now? - Zacks Investment Research
Zacks Investment Research
News
·
1w ago
Citigroup Issues Pessimistic Forecast for Palantir Technologies (NASDAQ:PLTR) Stock Price - MarketBeat
MarketBeat
News
·
1w ago