refresh

트렌딩 기업

트렌딩 기업

채용

채용Perplexity AI

AI Infra Engineer

Perplexity AI

AI Infra Engineer

Perplexity AI

London

·

On-site

·

Full-time

·

1w ago

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, Py Torch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.

Responsibilities

  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads

  • Manage and optimize Slurm-based HPC environments for distributed training of large language models

  • Develop robust APIs and orchestration systems for both training pipelines and inference services

  • Implement resource scheduling and job management systems across heterogeneous compute environments

  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure

  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm

  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services

  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Qualifications

  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management

  • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization

  • Experience with deploying and managing distributed training systems at scale

  • Deep understanding of container orchestration and distributed systems architecture

  • High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)

  • Experience managing GPU clusters and optimizing compute resource utilization

Required Skills

  • Expert-level Kubernetes administration and YAML configuration management

  • Proficiency with Slurm job scheduling, resource management, and cluster configuration

  • Python and C++ programming with focus on systems and infrastructure automation

  • Hands-on experience with ML frameworks such as Py Torch in distributed training contexts

  • Strong understanding of networking, storage, and compute resource management for ML workloads

  • Experience developing APIs and managing distributed systems for both batch and real-time workloads

  • Solid debugging and monitoring skills with expertise in observability tools for containerized environments

Preferred Skills

  • Experience with Kubernetes operators and custom controllers for ML workloads

  • Advanced Slurm administration including multi-cluster federation and advanced scheduling policies

  • Familiarity with GPU cluster management and CUDA optimization

  • Experience with other ML frameworks like Tensor Flow or distributed training libraries

  • Background in HPC environments, parallel computing, and high-performance networking

  • Knowledge of infrastructure as code (Terraform, Ansible) and Git Ops practices

  • Experience with container registries, image optimization, and multi-stage builds for ML workloads

Required Experience

  • Demonstrated experience managing large-scale Kubernetes deployments in production environments

  • Proven track record with Slurm cluster administration and HPC workload management

  • Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure

  • Experience supporting both long-running training jobs and high-availability inference services

  • Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

총 조회수

0

총 지원 클릭 수

0

모의 지원자 수

0

스크랩

0

Perplexity AI 소개

Perplexity AI

Perplexity AI, Inc., or simply Perplexity, is an American privately held software company offering a web search engine that processes user queries and synthesizes responses.

51-200

직원 수

San Francisco

본사 위치

$1B

기업 가치

리뷰

3.8

10개 리뷰

워라밸

3.2

보상

2.5

문화

4.0

커리어

2.5

경영진

2.8

65%

친구에게 추천

장점

Supportive team and management

Good work-life balance and flexibility

Cutting-edge technology and interesting projects

단점

Low compensation compared to industry standards

Poor management and lack of leadership direction

Fast-paced and overwhelming workload

연봉 정보

26개 데이터

Junior/L3

Junior/L3 · LLM Teacher

1개 리포트

$101,920

총 연봉

기본급

$78,400

주식

-

보너스

-

$101,920

$101,920

면접 경험

1개 면접

난이도

4.0

/ 5

소요 기간

14-28주

경험

긍정 0%

보통 0%

부정 100%

면접 과정

1

Application Review

2

HR Screen

3

Take-home Marketing Challenge

4

Hiring Manager Interview

5

Panel Interview

6

Offer

자주 나오는 질문

Digital Marketing Strategy

Campaign Performance Analysis

Behavioral/STAR

Technical Marketing Knowledge

Case Study