採用

Senior Product Manager - Observability and Resilience

NVIDIA

2 Locations

On-site

Full-time

1mo ago

Compensation

$208,000 - $327,750

Benefits & Perks

•Competitive base salary and bonus

•Generous vacation

•Team events

•Professional development

•Learning

Required Skills

Figma

Mixpanel

SQL

Product Manager

Resiliency and Observability for AI Computing Platforms

About the Role

NVIDIA has become the platform upon which every new AI-powered application is built. From healthcare research applications to autonomous vehicles, or voice-recognition systems, there is a need to simplify and deliver predictability for AI applications and workflows ... and NVIDIA is right in the center of this revolution. Resiliency and Observability are key to delivering customer value and exhilarating customer experience. This product manager will lead the development of foundational tools dedicated to ensuring the resiliency and observability of large-scale accelerated computing platforms. By creating essential tools for system diagnostics, performance monitoring, and automated recovery, they will empower customers to confidently operate both complex AI training and demanding inference workloads with maximum uptime and efficiency.

What You Will Be Doing

Be a subject‑matter expert on resiliency and observability. Deeply understand failure modes across the GPU hardware, network, and software stack, along with the telemetry signals that reveal them, and how they correlate to workload health and SLOs.
Master modern reliability architectures. Keep up-to-date with the industry trends.
Build for all that want to use.
Drive joint project planning. Define concrete achievements, tasks, and work for resiliency and observability initiatives with external partners.
Fuel innovation in reliability tooling. Lead ideation sessions to propose novel approaches and shape new proof‑of‑concepts.
Bridge development, SRE, and partner teams. Facilitate clear communication, triage emergent issues rapidly, and ensure feedback loops between engineering and customer operations remain tight.
Coordinate execution across different functions. Work with engineering, design, operations, sales, and marketing to embed resiliency and observability requirements into every product launch, capacity expansion, and lifecycle transition.

Qualifications

BS or MS in Computer Science, Computer Engineering, or a related field (or equivalent experience)
12+ years of product‑management experience in enterprise technology
Experience with GPU observability (DCGM, NVML, etc.) and integration into large‑scale telemetry systems
Deep knowledge of AI/ML infrastructure, high‑performance computing (HPC), networking, and cloud technologies (IaaS, PaaS) including containerization, Kubernetes, and automation tools
Familiarity with modern observability stacks: metrics, logs, traces, Open Telemetry, Prometheus/Grafana, ELK/Open Search
Experience building and preferably deep understanding of secure, compliance‑focused telemetry pipelines (SOC2, FedRAMP)
Ability to articulate trade‑offs among latency, throughput, cost, and reliability to both engineering and executive audiences
Data-driven approach: defines SLIs/SLOs, manages error budgets, and develops value models
Strong cross‑functional execution: writes clear specs and PRDs, produces GTM collateral, and leads agile processes

Ways to Stand Out

Masters/Phd or Expertise in distributed systems, performance modeling, or fault‑tolerant computing
Experience with MLOps and LLMOps ecosystems and integrating with enterprise platforms; deployments at modern data‑center scale; delivered ML/AI observability solutions for LLMOps, predictive incident detection, or anomaly classification
Startup or 0 -> 1 experience building cloud‑native observability or resilience tools; proven success bringing open‑source observability products to market and shaping GTM strategy
Familiarity with MLOps toolchains and integrations with monitoring platforms such as Splunk, Datadog, and Grafana Cloud
Expertise with containerization technologies like Docker and Kubernetes, plus virtualization
Proficiency in network architecture and high‑performance interconnects (Infini Band, Ethernet, RoCE)

About NVIDIA

We have some of the most forward-thinking and hardworking people in the world working for us and, due to outstanding growth, our elite engineering teams are growing fast. NVIDIA is widely considered to be one of the industry's most desirable employers. NVIDIA is at the center of Deep Learning, Artificial Intelligence, and Autonomous Vehicles. If you're looking for a challenge, thrives in an ambiguous environment and shares our passion for technology, we want to hear from you. We are looking for great people to help us accelerate the next wave of artificial intelligence.

NVIDIA is the world leader in accelerated computing. NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest industries and profoundly impacting society.

Benefits and Compensation

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 208,000 USD - 327,750 USD. You will also be eligible for equity and benefits.

Additional Information

Work Location: Hybrid
Application Deadline: At least until January 13, 2026
Position Type: This posting is for an existing vacancy
Note: NVIDIA uses AI tools in its recruiting processes.

Equal Opportunity

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Total Views

Apply Clicks

Mock Applicants

Scraps

Similar Jobs

Senior Product Manager, AIOps

PagerDuty · Remote (Canada)

Senior Product Manager - SDK & AI

Fivetran · Toronto, Ontario, Canada

Staff Product Manager, Software Supply Chain Security

GitLab · Remote

Staff Product Manager: Ad Campaign Recommendations and Opti-Score

Reddit · Remote - United States

Principal Product Manager, Growth

PagerDuty · Remote (USA)

About NVIDIA

NVIDIA

Public

A computing platform company operating at the intersection of graphics, HPC, and AI.

10,001+

Employees

Santa Clara

Headquarters

$4.57T

Valuation

Reviews

4.1

10 reviews

Work Life Balance

3.5

Compensation

4.2

Culture

4.3

Career

4.5

Management

4.0

75%

Recommend to a Friend

Pros

Great culture and supportive environment

Smart colleagues and excellent people

Cutting-edge technology and learning opportunities

Cons

Team-dependent experience and outcomes

Work-life balance issues with long hours

Politics and influence over competence

Salary Ranges

47 data points

L3 · Product Manager IC2

0 reports

$183,722

total / year

Base

Stock

Bonus

$156,163

$211,281

Interview Experience

7 interviews

Difficulty

3.1

/ 5

Experience

Positive 0%

Neutral 86%

Negative 14%

Interview Process

Application Review

Recruiter Screen

Online Assessment

Technical Interview

System Design Interview

Team Review

Common Questions

Coding/Algorithm

System Design

Technical Knowledge

Behavioral/STAR

News & Buzz

Negotiating NVIDIA's Offer

Base, stock, and sign-on negotiable. Recruiters invested in closing candidates. CEO reviews all 42K employee salaries monthly. Stock growth has made many employees millionaires.

News

NaNw ago

NVIDIA Company Reviews

WLB rated 3.9/5 (lowest category). 64% satisfied with WLB but 53% feel burnt out. Compensation rated 4.4-4.5/5. Experience highly team-dependent.

News

NaNw ago

NVIDIA Culture Discussions

Team-dependent experience; sink-or-swim culture that rewards high performers but can be overwhelming. No politics, flat structure, but demanding workload with some teams requiring evening/weekend work.

News

NaNw ago

NVIDIA Interview Discussions

Technical bar is high with 4-6 rounds. Process takes 4-8 weeks. Expect C++ questions, LeetCode medium, and system design. Difficulty rated 3.16/5.

News

NaNw ago