招聘

Site Reliability Lead

Vanguard

Malvern, PA

On-site

Full-time

5d ago

Responsibilities

Ensure system reliability, stability and performance by maintaining service-level objectives (SLOs) and minimizing downtime and incidents.
Collaborate with internal teams to assess system health, stability and resilience, providing architectural and design recommendations for reliability.
Lead incident management and post-incident reviews, diagnosing issues, deploying fixes and implementing preventive measures.
Drive automation of operational tasks, including deployments, monitoring, scaling and system recovery, to improve efficiency and reduce manual intervention.
Define and track key performance indicators (KPIs) such as availability, latency and error rates to optimize system performance and inform decision-making.
Plan and execute chaos engineering experiments to test system resilience and coordinate performance testing for reliability improvements.
Ensure alignment between service-level indicators (SLIs) and service-level objectives (SLOs) across the product family.
Develop and maintain product-level runbooks for incident response, collaborating with SRE teams to ensure effective recovery processes.
Provide leadership in tool selection and best practices for site reliability engineering (SRE), making final decisions on tools, libraries and standards.
Work closely with development teams to improve software reliability, scalability and resilience by offering feedback on design and architecture.
Lead troubleshooting and triage efforts during user-impacting incidents, ensuring swift resolution and minimal disruption.
Participate in special projects and continuous improvement initiatives, supporting long-term reliability and scalability goals.

Qualifications

Minimum 8 years of related experience, with at least 5 years in software development.
Bachelor’s degree (B.E./B.Tech) in Computer Science or IT, or Bachelor’s in Computer Applications (BCA) from a recognized institution.
Expertise in Site Reliability Engineering (SRE), DevOps, and system reliability, ensuring high availability and performance.
Strong programming and scripting skills in Python, Go, Bash, or Java, with experience in automating operational tasks.
Proficiency in observability and resiliency tools such as Splunk, Honeycomb, Datadog, Prometheus, or Grafana.
Hands-on experience with cloud platforms (AWS, Azure, GCP) and containerization/orchestration tools like Kubernetes, Docker, ECS, or Fargate.
Solid understanding of automation, Infrastructure-as-Code (IaC), and configuration management using Terraform, Ansible, or CloudFormation.
Experience with CI/CD pipelines, deployment automation, and version control tools like GitHub, Bitbucket, Jenkins, or Bamboo.
Deep knowledge of incident management, root cause analysis, and post-incident reviews, focusing on continuous improvement
Experience in mobile platform reliability (Android, iOS), including performance monitoring and optimization is desired.

Special Factors

Sponsorship

Vanguard is not offering visa sponsorship for this position.

About Vanguard

At Vanguard, we don't just have a mission—we're on a mission.

To work for the long-term financial wellbeing of our clients. To lead through product and services that transform our clients' lives. To learn and develop our skills as individuals and as a team. From Malvern to Melbourne, our mission drives us forward and inspires us to be our best.

How We Work

Vanguard has implemented a hybrid working model for the majority of our crew members, designed to capture the benefits of enhanced flexibility while enabling in-person learning, collaboration, and connection. We believe our mission-driven and highly collaborative culture is a critical enabler to support long-term client outcomes and enrich the employee experience.

Total Views

Apply Clicks

Mock Applicants

Scraps

Similar Jobs

Manager, DevOps Engineering

Rocket Lab · Chantilly, VA

Director Software Development, AI Models and Research

AMD · San Jose

Site Reliability Engineer, Lead - Data Platforms

Toyota USA · Plano, Texas

Senior Site Reliability Engineer

Workday · 2 Locations

Director - Pre-silicon Emulation / Post Silicon Validation

AMD · Bangalore

About Vanguard

Vanguard

A client-owned investment company that offers low-cost mutual funds, ETFs, advice, and related services to institutional and individual investors, and financial professionals.

10,001+

Employees

Kelayres

Headquarters

Reviews

3.4

3 reviews

Work Life Balance

2.5

Compensation

3.2

Culture

2.8

Career

3.5

Management

3.0

45%

Recommend to a Friend

Pros

Competitive compensation package with bonuses

Good foundation for career development

Interesting programs aligned with education

Cons

Long commute requirements (2.5 hours)

Mandatory on-site presence multiple days

Pay below industry standards

Salary Ranges

1,532 data points

Junior/L3

Junior/L3 · Client Relationship Associate

529 reports

$60,018

total / year

Base

$55,076

Stock

Bonus

$4,942

$46,375

$78,763

Interview Experience

3 interviews

Difficulty

3.0

/ 5

Duration

14-28 weeks

Interview Process

Application Review

Recruiter/HR Phone Screen

Technical/Case Study Round

Final Round Interview

Offer

Common Questions

Behavioral/STAR

Technical Knowledge

Case Study

Past Experience

Culture Fit

News & Buzz

Vanguard Personalized Indexing Management LLC Sells 10,432 Shares of Owens Corning Inc $OC - MarketBeat

Source: MarketBeat

News

5w ago

Vanguard Mining Reports Re-Assay Program for Redonda Copper-Molybdenum Project - TheNewswire

Source: TheNewswire

News

5w ago

Why Vanguard says investors should flip the traditional 60/40 portfolio in favor of bonds - Business Insider

Source: Business Insider

News

5w ago

3 Vanguard Mutual Funds to Buy for Spectacular Returns - TradingView

Source: TradingView

News

5w ago