热门公司

招聘

职位Perplexity AI

AI Infra Engineer

Perplexity AI

AI Infra Engineer

Perplexity AI

London

·

On-site

·

Full-time

·

1w ago

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, Py Torch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.

Responsibilities

  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads

  • Manage and optimize Slurm-based HPC environments for distributed training of large language models

  • Develop robust APIs and orchestration systems for both training pipelines and inference services

  • Implement resource scheduling and job management systems across heterogeneous compute environments

  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure

  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm

  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services

  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Qualifications

  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management

  • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization

  • Experience with deploying and managing distributed training systems at scale

  • Deep understanding of container orchestration and distributed systems architecture

  • High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)

  • Experience managing GPU clusters and optimizing compute resource utilization

Required Skills

  • Expert-level Kubernetes administration and YAML configuration management

  • Proficiency with Slurm job scheduling, resource management, and cluster configuration

  • Python and C++ programming with focus on systems and infrastructure automation

  • Hands-on experience with ML frameworks such as Py Torch in distributed training contexts

  • Strong understanding of networking, storage, and compute resource management for ML workloads

  • Experience developing APIs and managing distributed systems for both batch and real-time workloads

  • Solid debugging and monitoring skills with expertise in observability tools for containerized environments

Preferred Skills

  • Experience with Kubernetes operators and custom controllers for ML workloads

  • Advanced Slurm administration including multi-cluster federation and advanced scheduling policies

  • Familiarity with GPU cluster management and CUDA optimization

  • Experience with other ML frameworks like Tensor Flow or distributed training libraries

  • Background in HPC environments, parallel computing, and high-performance networking

  • Knowledge of infrastructure as code (Terraform, Ansible) and Git Ops practices

  • Experience with container registries, image optimization, and multi-stage builds for ML workloads

Required Experience

  • Demonstrated experience managing large-scale Kubernetes deployments in production environments

  • Proven track record with Slurm cluster administration and HPC workload management

  • Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure

  • Experience supporting both long-running training jobs and high-availability inference services

  • Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

总浏览量

0

申请点击数

0

模拟申请者数

0

收藏

0

关于Perplexity AI

Perplexity AI

Perplexity AI, Inc., or simply Perplexity, is an American privately held software company offering a web search engine that processes user queries and synthesizes responses.

51-200

员工数

San Francisco

总部位置

$1B

企业估值

评价

3.8

10条评价

工作生活平衡

3.2

薪酬

2.5

企业文化

4.0

职业发展

2.5

管理层

2.8

65%

推荐给朋友

优点

Supportive team and management

Good work-life balance and flexibility

Cutting-edge technology and interesting projects

缺点

Low compensation compared to industry standards

Poor management and lack of leadership direction

Fast-paced and overwhelming workload

薪资范围

26个数据点

Junior/L3

Junior/L3 · LLM Teacher

1份报告

$101,920

年薪总额

基本工资

$78,400

股票

-

奖金

-

$101,920

$101,920

面试经验

1次面试

难度

4.0

/ 5

时长

14-28周

体验

正面 0%

中性 0%

负面 100%

面试流程

1

Application Review

2

HR Screen

3

Take-home Marketing Challenge

4

Hiring Manager Interview

5

Panel Interview

6

Offer

常见问题

Digital Marketing Strategy

Campaign Performance Analysis

Behavioral/STAR

Technical Marketing Knowledge

Case Study