採用
We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, Py Torch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.
Responsibilities
-
Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
-
Manage and optimize Slurm-based HPC environments for distributed training of large language models
-
Develop robust APIs and orchestration systems for both training pipelines and inference services
-
Implement resource scheduling and job management systems across heterogeneous compute environments
-
Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
-
Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
-
Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
-
Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands
Qualifications
-
Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
-
Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
-
Experience with deploying and managing distributed training systems at scale
-
Deep understanding of container orchestration and distributed systems architecture
-
High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
-
Experience managing GPU clusters and optimizing compute resource utilization
Required Skills
-
Expert-level Kubernetes administration and YAML configuration management
-
Proficiency with Slurm job scheduling, resource management, and cluster configuration
-
Python and C++ programming with focus on systems and infrastructure automation
-
Hands-on experience with ML frameworks such as Py Torch in distributed training contexts
-
Strong understanding of networking, storage, and compute resource management for ML workloads
-
Experience developing APIs and managing distributed systems for both batch and real-time workloads
-
Solid debugging and monitoring skills with expertise in observability tools for containerized environments
Preferred Skills
-
Experience with Kubernetes operators and custom controllers for ML workloads
-
Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
-
Familiarity with GPU cluster management and CUDA optimization
-
Experience with other ML frameworks like Tensor Flow or distributed training libraries
-
Background in HPC environments, parallel computing, and high-performance networking
-
Knowledge of infrastructure as code (Terraform, Ansible) and Git Ops practices
-
Experience with container registries, image optimization, and multi-stage builds for ML workloads
Required Experience
-
Demonstrated experience managing large-scale Kubernetes deployments in production environments
-
Proven track record with Slurm cluster administration and HPC workload management
-
Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
-
Experience supporting both long-running training jobs and high-availability inference services
-
Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management
総閲覧数
0
応募クリック数
0
模擬応募者数
0
スクラップ
0
類似の求人

TEST ARCHITECT L1(CONTRACT)
Wipro · London, United Kingdom

Distribution Operations Engineer
Warner Bros. Discovery · London, Chiswick Park Building 2

Distribution Operations Engineer
Warner Bros. Discovery · London, United Kingdom

Software Engineer - C++ / Video Processing
Medtronic · London, London, United Kingdom

ENTERPRISE ARCHITECT L1
Wipro · London, United Kingdom
Perplexity AIについて

Perplexity AI
Series BPerplexity AI, Inc., or simply Perplexity, is an American privately held software company offering a web search engine that processes user queries and synthesizes responses.
51-200
従業員数
San Francisco
本社所在地
$1B
企業価値
レビュー
3.8
10件のレビュー
ワークライフバランス
3.2
報酬
2.5
企業文化
4.0
キャリア
2.5
経営陣
2.8
65%
友人に勧める
良い点
Supportive team and management
Good work-life balance and flexibility
Cutting-edge technology and interesting projects
改善点
Low compensation compared to industry standards
Poor management and lack of leadership direction
Fast-paced and overwhelming workload
給与レンジ
26件のデータ
Junior/L3
Junior/L3 · LLM Teacher
1件のレポート
$101,920
年収総額
基本給
$78,400
ストック
-
ボーナス
-
$101,920
$101,920
面接体験
1件の面接
難易度
4.0
/ 5
期間
14-28週間
体験
ポジティブ 0%
普通 0%
ネガティブ 100%
面接プロセス
1
Application Review
2
HR Screen
3
Take-home Marketing Challenge
4
Hiring Manager Interview
5
Panel Interview
6
Offer
よくある質問
Digital Marketing Strategy
Campaign Performance Analysis
Behavioral/STAR
Technical Marketing Knowledge
Case Study
ニュース&話題
Perplexity launches Personal Computer that brings AI agents Directly on your Mac - The Times of India
The Times of India
News
·
1d ago
"Perplexity" Unveils a Broader Vision for the Role of Artificial Intelligence in Personal Computing - وكالة صدى نيوز
وكالة صدى نيوز
News
·
2d ago
Perplexity AI Cheat Sheet: How an ‘Answer Engine’ Is Challenging Gemini, ChatGPT - eWeek
eWeek
News
·
2d ago
Perplexity priced me out of its OpenClaw clone - PCWorld
PCWorld
News
·
2d ago