採用

Sr Director of Software Engineering- AI Infrastructure Platform
Palo Alto, CA, United States, US
·
On-site
·
Full-time
·
1w ago
Your opportunity to make a real impact and shape the future of financial services is waiting for you. Let’s push the boundaries of what's possible together.
As a Senior Director of Software Engineering at JPMorgan Chase within the firmwide AI Infrastructure Platform organization, you will lead multiple technical areas and manage the activities of multiple departments responsible for delivering a unified AI infrastructure layer across on‑premises environments, public cloud, and emerging accelerated‑compute vendors. You will collaborate across AI/ML engineering, infrastructure, security and controls, and vendor teams to ensure the firm remains at the forefront of AI platform capabilities, operational excellence, and industry best practices.
In this role, you will own training and experimentation on a Kubernetes‑standardized platform. While a dedicated architecture function exists, you will act as an active design partner—guiding architectural trade‑offs and ensuring designs translate into reliable, secure, and operable systems at enterprise scale.
Job responsibilities
- Lead multiple technology and platform implementations across departments to deliver firmwide AI infrastructure objectives, with a primary focus on training and experimentation platforms operating at enterprise scale.
- Own the design, delivery, and evolution of a Kubernetes‑first training and experimentation platform, including Kubernetes‑native support for batch and distributed training jobs, lifecycle management, retry semantics, and failure recovery patterns.
- Standardize AI developer workflows for experimentation, enabling self‑service job submission, reusable templates and golden paths, reproducibility mechanisms, and consistent runtime behavior across hybrid deployment environments.
- Build and evolve platform APIs and automation, including Kubernetes controllers and operators where appropriate, to ensure the platform is safe, scalable, and easy to adopt across teams.
- Drive measurable improvements in GPU availability and utilization through reliability engineering, fleet readiness patterns, and accelerated capacity onboarding.
- Define and implement governance‑based scheduling and placement strategies, including:
Multi‑tenant GPU quotas and guardrails,
Priority, admission control, and reservation patterns,
Preemption policies,
Fragmentation reduction and topology‑aware placement (GPU type, MIG, and topology awareness)
-
Embed enterprise‑grade security, risk, and control requirements into platform defaults, including IAM and RBAC controls, secrets management, audit logging, policy enforcement, network segmentation, and controlled change management.
-
Drive operational excellence by establishing SLIs and SLOs, managing error budgets, leading incident management practices, forecasting capacity, and delivering end‑to‑end platform observability across job lifecycles and GPU telemetry.
-
Act as the primary interface with senior leaders, stakeholders, and executives, driving alignment and consensus across competing priorities and complex initiatives.
-
Lead multiple engineering teams and managers, building a high‑performing organization with strong engineering standards, scalable operating models, and a culture of accountability and continuous improvement.
-
Champion the firm’s culture of diversity, opportunity, inclusion, and respect.
Required qualifications, capabilities, and skills
- 15+ years of engineering experience, including 8+ years of senior engineering leadership experience with responsibility for managing managers.
- Demonstrated experience delivering platform products (beyond foundational infrastructure) with strong adoption, reliability, and operational maturity.
- Experience developing and leading large, cross‑functional engineering teams within highly matrixed and complex enterprise environments.
- Proven track record of leading complex initiatives supporting distributed system design, testing, and operational stability at scale.
- Deep hands‑on expertise with Kubernetes‑based platforms, including:
Multi‑tenancy, RBAC, admission control, and network policy,
Multi‑cluster operations, upgrades, and cluster lifecycle management,
Controllers, operators (CRDs), and platform API design patterns
- Experience supporting AI training and experimentation platforms, including:
Py Torch and distributed training concepts such as scaling, orchestration, and failure modes,
Ray or similar frameworks for distributed experimentation execution,
Familiarity with Slurm or equivalent HPC or batch schedulers and core concepts such as queues, fair‑share, reservations, and preemption
- Understanding of modern AI inference stacks (for example, vLLM) and how serving constraints—latency, throughput, batching, KV cache behavior, and GPU memory limits—influence training and experimentation platform design.
- Strong understanding of GPU infrastructure fundamentals, including NVIDIA ecosystem capabilities, health and telemetry signals, and scheduling and placement constraints.
- Extensive practical experience with cloud‑native technologies and hybrid infrastructure environments spanning on‑premises and public cloud.
- Experience hiring, developing, coaching, and retaining high‑performing engineering talent.
Preferred qualifications, capabilities, and skills
- Experience operating large‑scale GPU fleets, including heterogeneous accelerator environments.
- Experience delivering hybrid AI platforms across on‑premises infrastructure, public cloud, and specialized accelerated‑compute vendors.
- Experience working at the code level within large‑scale distributed systems.
- This position is subject to Section 19 of the Federal Deposit Insurance Act. As such, an employment offer for this position is contingent on JPMorgan Chase’s review of criminal conviction history, including pretrial diversions or program entries.
総閲覧数
0
応募クリック数
0
模擬応募者数
0
スクラップ
0
類似の求人

Lead Reliability Engineer
Albertsons · Denver, CO, United States, US

The North Face: Store Manager - Washington Square Mall
VF Corporation · USCA > USA > Oregon > Portland 536 - TNF

Chief Engineer - Propulsion Integration
Anduril · Costa Mesa, California, United States

Advanced Package and 3DIC Solutions Director
Cadence · SAN JOSE; AUSTIN

SW Test Solutions Manager
BAE Systems · Nashua, New Hampshire, United States
JPMorgan Chaseについて

JPMorgan Chase
PublicJPMorgan Chase & Co. is an American multinational banking institution headquartered in New York City and incorporated in Delaware. It is the largest bank in the United States, and the world's largest bank by market capitalization as of 2025.
300,000+
従業員数
New York City
本社所在地
$500B
企業価値
レビュー
3.8
10件のレビュー
ワークライフバランス
3.2
報酬
4.1
企業文化
3.8
キャリア
3.0
経営陣
2.5
65%
友人に勧める
良い点
Good benefits and compensation
Supportive and collaborative environment
Flexible work arrangements
改善点
Long hours and heavy workload
Management issues and lack of direction
High stress during peak times
給与レンジ
41件のデータ
Junior/L3
Mid/L4
Senior/L5
Junior/L3 · Analytics Solutions Associate
1件のレポート
$139,000
年収総額
基本給
$107,000
ストック
-
ボーナス
-
$139,000
$139,000
面接体験
5件の面接
難易度
3.0
/ 5
期間
14-28週間
内定率
40%
体験
ポジティブ 20%
普通 80%
ネガティブ 0%
面接プロセス
1
Application Review
2
HireVue Video Interview
3
Recruiter Screen
4
Superday/Panel Interview
5
Final Interview
6
Offer
よくある質問
Behavioral/STAR
Technical Knowledge
Culture Fit
Past Experience
Case Study
ニュース&話題
Spirepoint Private Client LLC Purchases 3,449 Shares of JPMorgan Chase & Co. $JPM - MarketBeat
MarketBeat
News
·
3d ago
As the world’s largest bank JP Morgan tests Anthropic’s AI tool Mythos, CEO Jamie Dimon admits 'threat'; - The Times of India
The Times of India
News
·
3d ago
Fortifying the enterprise: 10 actions to take now for AI-ready cyber resilience - JPMorganChase
JPMorganChase
News
·
3d ago
JPMorgan Chase & Co. Issues Pessimistic Forecast for Super Micro Computer (NASDAQ:SMCI) Stock Price - MarketBeat
MarketBeat
News
·
4d ago