招聘
Required Skills
Python
Kubernetes
MLOps
ML Infrastructure
Docker
About Rivian Rivian is on a mission to keep the world adventurous forever.
This goes for the emissions-free Electric Adventure Vehicles we build, and the curious, courageous souls we seek to attract.
As a company, we constantly challenge what’s possible, never simply accepting what has always been done.
We reframe old problems, seek new solutions and operate comfortably in areas that are unknown.
Our backgrounds are diverse, but our team shares a love of the outdoors and a desire to protect it for future generations.
Role Summary The Autonomy org at Rivian is seeking a Senior Software Engineer, MLOps to join the Data team.
In this role, you will be the bridge between infrastructure and Deep Learning.
You will architect and build the platforms that enable our Research and Perception scientists to iterate faster.
This candidate needs a deep understanding of distributed training orchestration, model serving, GPU acceleration, and the full ML lifecycle.
You will be responsible for building complex, mission-critical ML infrastructure that powers Rivian's autonomous capabilities.
You will collaborate with Autonomy researchers, Product Management, and Infrastructure partners to leverage best practices and scale Rivian’s model training platform (on EKS/Ray), inference pipelines, and experiment tracking systems for the continuous improvement of the Autonomy stack of Rivian.
Responsibilities Architect Training Platforms: Lead the design and implementation of large-scale distributed training clusters using Kubernetes (EKS) and framework-native distributed strategies (e.g., Py Torch Distributed, Ray Train).
Orchestrate GPU Resources: Optimize GPU utilization and scheduling logic (using tools like Kueue, Volcano, and Karpenter) to maximize training throughput and minimize idle costs across thousands of GPUs. ML CI/CD (CT/CD): Own the pipelines for Continuous Training and Continuous Deployment.
Automate the path from code commit, training job, model evaluation ,model registry ,deployment.
Model Serving Infrastructure: Build and optimize high-throughput, low-latency inference services using technologies like NVIDIA Triton, Torch Serve, or vLLM.
Observability for ML: Implement monitoring specifically for ML workloads, including GPU-level metrics, training stability, model drift, and inference latency (using Prometheus, Grafana, Weights & Biases, or similar).
Developer Experience: Create abstractions and CLI tools that allow Data Scientists to launch experiments without needing deep Kubernetes expertise.
Cost Optimization: Drive cost-efficiency strategies for AWS GPU instances (Spot instances, mixed-instance policies) and storage tiers.
Fault Tolerance: Design checkpointing and recovery strategies for long-running training jobs to ensure resilience against node failures.
Qualifications 5+ Years of engineering experience, with at least 3+ years specifically in MLOps or ML Infrastructure.
Deep Kubernetes Expertise: Extensive experience managing EKS for batch workloads, including familiarity with CRDs, Operators, and scheduling specifically for ML (e.g., Kube Ray, MPIOperator). ML Frameworks: Strong familiarity with the operational side of Py Torch, Tensor Flow, or JAX.
You understand how distributed data parallel (DDP) and FSDP work at an infrastructure level.
Distributed Compute: Hands-on experience with orchestration frameworks like Ray, Spark, or Kubeflow.
Infrastructure as Code: Proficiency with Terraform, AWS CDK, or Helm for defining ML infrastructure.
Cloud Native ML:
Experience: with AWS services specific to ML (Sage Maker, FSx for Lustre, EFA/Elastic Fabric Adapter networking) or similar experience from GCP or Azure.
Programming: Strong proficiency in Python (required for ML tooling) and Go (preferred for K8s controllers/infra).
Model Lifecycle:
Experience: with Model Registries (MLflow or similar) and Feature Stores.
Containerization: Expertise in optimizing Docker containers for GPU workloads (multi-stage builds, CUDA drivers, reducing image size).
Debugging:
Experience: performing Root Cause Analysis (RCA) on complex distributed systems (e.g., diagnosing NCCL communication hangs or OOM errors).
Bonus Points:
Experience: with NVIDIA Triton Inference Server or TensorRT optimization.
Knowledge of high-performance networking (Infiniband, EFA, RDMA).
Contributions to open-source MLOps projects (Ray, Kubeflow, etc.).
Equal Opportunity Rivian is an equal opportunity employer and complies with all applicable federal, state, and local fair employment practices laws.
All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, sex, sexual orientation, gender, gender expression, gender identity, genetic information or characteristics, physical or mental disability, marital/domestic partner status, age, military/veteran status, medical condition, or any other characteristic protected by law.
Rivian is committed to ensuring that our hiring process is accessible for persons with disabilities.
If you have a disability or limitation, such as those covered by the Americans with Disabilities Act, that requires accommodations to assist you in the search and application process, please email us at candidateaccommodations@rivian.com.
Candidate Data Privacy Rivian may collect, use and disclose your personal information or personal data (within the meaning of the applicable data protection laws) when you apply for employment and/or participate in our recruitment processes (“Candidate Personal Data”).
This data includes contact, demographic, communications, educational, professional, employment, social media/website, network/device, recruiting system usage/interaction, security and preference information.
Rivian may use your Candidate Personal Data for the purposes of (i) tracking interactions with our recruiting system; (ii) carrying out, analyzing and improving our application and recruitment process, including assessing you and your application and conducting employment, background and reference checks; (iii) establishing an employment relationship or entering into an employment contract with you; (iv) complying with our legal, regulatory and corporate governance obligations; (v) recordkeeping; (vi) ensuring network and information security and preventing fraud; and (vii) as otherwise required or permitted by applicable law.
Rivian may share your Candidate Personal Data with (i) internal personnel who have a need to know such information in order to perform their duties, including individuals on our People Team, Finance, Legal, and the team(s) with the position(s) for which you are applying; (ii) Rivian affiliates; and (iii) Rivian’s service providers, including providers of background checks, staffing services, and cloud services.
Rivian may transfer or store internationally your Candidate Personal Data, including to or in the United States, Canada, the United Kingdom, and the European Union and in the cloud, and this data may be subject to the laws and accessible to the courts, law enforcement and national security authorities of such jurisdictions.
Please note that we are currently not accepting applications from third party application services.
Architect Training Platforms: Lead the design and implementation of large-scale distributed training clusters using Kubernetes (EKS) and framework-native distributed strategies (e.g., Py Torch Distributed, Ray Train).
Orchestrate GPU Resources: Optimize GPU utilization and scheduling logic (using tools like Kueue, Volcano, and Karpenter) to maximize training throughput and minimize idle costs across thousands of GPUs. ML CI/CD (CT/CD): Own the pipelines for Continuous Training and Continuous Deployment.
Automate the path from code commit, training job, model evaluation ,model registry ,deployment.
Model Serving Infrastructure: Build and optimize high-throughput, low-latency inference services using technologies like NVIDIA Triton, Torch Serve, or vLLM.
Observability for ML: Implement monitoring specifically for ML workloads, including GPU-level metrics, training stability, model drift, and inference latency (using Prometheus, Grafana, Weights & Biases, or similar).
Developer Experience: Create abstractions and CLI tools that allow Data Scientists to launch experiments without needing deep Kubernetes expertise.
Cost Optimization: Drive cost-efficiency strategies for AWS GPU instances (Spot instances, mixed-instance policies) and storage tiers.
Fault Tolerance: Design checkpointing and recovery strategies for long-running training jobs to ensure resilience against node failures.
5+ Years of engineering experience, with at least 3+ years specifically in MLOps or ML Infrastructure.
Deep Kubernetes Expertise: Extensive experience managing EKS for batch workloads, including familiarity with CRDs, Operators, and scheduling specifically for ML (e.g., Kube Ray, MPIOperator). ML Frameworks: Strong familiarity with the operational side of Py Torch, Tensor Flow, or JAX.
You understand how distributed data parallel (DDP) and FSDP work at an infrastructure level.
Distributed Compute: Hands-on experience with orchestration frameworks like Ray, Spark, or Kubeflow.
Infrastructure as Code: Proficiency with Terraform, AWS CDK, or Helm for defining ML infrastructure.
Cloud Native ML:
Experience: with AWS services specific to ML (Sage Maker, FSx for Lustre, EFA/Elastic Fabric Adapter networking) or similar experience from GCP or Azure.
Programming: Strong proficiency in Python (required for ML tooling) and Go (preferred for K8s controllers/infra).
Model Lifecycle:
Experience: with Model Registries (MLflow or similar) and Feature Stores.
Containerization: Expertise in optimizing Docker containers for GPU workloads (multi-stage builds, CUDA drivers, reducing image size).
Debugging:
Experience: performing Root Cause Analysis (RCA) on complex distributed systems (e.g., diagnosing NCCL communication hangs or OOM errors).
Bonus Points:
Experience: with NVIDIA Triton Inference Server or TensorRT optimization.
Knowledge of high-performance networking (Infiniband, EFA, RDMA).
Contributions to open-source MLOps projects (Ray, Kubeflow, etc.).
Total Views
0
Apply Clicks
0
Mock Applicants
0
Scraps
0
Similar Jobs

Staff Mechanical Engineer (Battery Structural Integrity Analysis Lead)
Enovix · Hyderabad, India

Sr. Industrial Engineer, Satellite Manufacturing
Blue Origin · 2 Locations

Lead Automation Engineer
GE Vernova · Singapore

Senior Value Engineer – Direct Air Capture
GE Vernova · Greenville

Sr. Staff Software Engineer
GE Vernova · Hyderabad
About Rivian

Rivian
PublicRivian Automotive, Inc., is an American electric vehicle manufacturer and automotive technology company founded in 2009.
5,001-10,000
Employees
Irvine
Headquarters
$12B
Valuation
Reviews
4.2
25 reviews
Work Life Balance
3.8
Compensation
4.3
Culture
4.4
Career
4.5
Management
4.0
78%
Recommend to a Friend
Pros
Cutting-edge technology stack and interesting technical challenges
Competitive compensation packages with equity
Strong engineering culture with focus on code quality
Cons
Organizational changes and restructuring can be disruptive
Work-life balance can be challenging during product launches
Some legacy systems that need modernization
Salary Ranges
30 data points
Mid/L4
Senior/L5
Mid/L4 · Data Engineer II
1 reports
$152,100
total / year
Base
$117,000
Stock
-
Bonus
-
$152,100
$152,100
Interview Experience
4 interviews
Difficulty
3.0
/ 5
Duration
14-28 weeks
Offer Rate
50%
Experience
Positive 0%
Neutral 50%
Negative 50%
Interview Process
1
Application Review
2
Recruiter Screen
3
Online Assessment
4
Technical Phone Screen
5
Technical Interview
6
Team Matching
7
Offer
Common Questions
Coding/Algorithm
Technical Knowledge
Behavioral/STAR
System Design
Past Experience
News & Buzz
Rivian's Electric Vehicle Outlook and Challenges - Intellectia AI
Source: Intellectia AI
News
·
4w ago
Is Rivian Stock a Buy in 2026? - The Motley Fool
Source: The Motley Fool
News
·
4w ago
Rivian Automotive, Inc. $RIVN Stock Holdings Cut by Great Lakes Advisors LLC - MarketBeat
Source: MarketBeat
News
·
4w ago
Is Rivian Stock a Buy Before Feb. 12? - AOL.com
Source: AOL.com
News
·
5w ago