热门公司

Poolside
Poolside

AI code generation foundation models

Member of Engineering (Scalability)

职能工程
级别中级
地点Remote (EMEA/East Coast)
方式远程
类型全职
发布2个月前
立即申请

必备技能

Linux

ABOUT POOLSIDE

In this decade, the world will create Artificial General Intelligence. There will only be a small number of companies who will achieve this. Their ability to stack advantages and pull ahead will define the winners. These companies will move faster than anyone else. They will attract the world's most capable talent. They will be on the forefront of applied research, engineering, infrastructure and deployment at scale. They will continue to scale their training to larger & more capable models. They will be given the right to raise large amounts of capital along their journey to enable this. They will create powerful economic engines. They will obsess over the success of their users and customers.

Poolside exists to be this company: to build a world where AI will be the engine behind economically valuable work and scientific progress. We believe the fastest way to reach AGI lies in accelerating software development itself, by reshaping the developer experience with agentic systems, coding assistants, and the frontier models that power them. We deploy these systems directly into the development environments of security-conscious enterprises.

ABOUT OUR TEAM:

We were founded in the US and have our home there, but our team is distributed across Europe and North America. We get our fix of in-person collaboration (and croissants) in Paris each month for 3 days, always Monday-Wednesday, with an open invitation to stay the whole week. We also do longer off-sites once a year.

Our team is a multidisciplinary blend of research, engineering, and business experts. What unites us is our deep care for what we build together. We’re in a race that requires hard work, intellectual curiosity, and obsession; to balance this intensity, we’ve assembled a team of low ego and kind-hearted individuals who have built the special culture Poolside has. By building collaboratively and with intention, we create a compounding effect that moves the entire company forward towards our mission: reaching AGI through intelligence systems built for software development.

ABOUT THE ROLE:

You would be working in our pre-training team focused on building out our distributed training and inference of Large Language Models (LLMs). This is a hands-on role that focuses on software reliability and fault tolerance. You will work on cross-platform checkpointing, NCCL recovery, and hardware fault detection. You will make high-level tools. You will not be afraid of debugging Linux kernel modules. You will have access to thousands of GPUs to test changes.

Strong engineering skills are a prerequisite. We assume good knowledge of Torch, NVIDIA GPU architecture, reliability concepts, distributed systems, and best coding practices. A basic understanding of LLM training and inference principles is required. We look for fast learners who are prepared for a steep learning curve and are not afraid to step out of their comfort zone.

YOUR MISSION

To help train the best foundational models for source code generation in the world

RESPONSIBILITIES:

  • Identify, study, and troubleshoot hardware problems during training at scale

  • Minimize the GPU idle time during faults, both operationally and strategically

  • Design and develop tools and add-ons to accelerate the training recovery

  • Improve the performance and reliability of checkpointing

  • Write high-quality Python (Py Torch), Cython, C/C++, CUDA API code

SKILLS & EXPERIENCE:

  • Understanding of Large Language Models (LLM)

  • Basic knowledge of Transformers

  • Knowledge of deep learning fundamentals

  • Strong engineering background

  • Programming experience

  • Linux API, Linux kernel

  • Strong algorithmic skills

  • Python with numpy, Py Torch, or Jax

  • C/C++

  • NCCL

  • Use modern tools and are always looking to improve

  • Strong critical thinking and ability to question code quality policies when applicable

  • Distributed systems

  • Reliability

  • Observability

  • Fault-tolerance

  • K8s stack

PROCESS

  • Intro call with one of our Founding Engineers

  • Technical Interview(s) with one of our Founding Engineers

  • Team fit call with the People team

  • Final interview with one of our Founding Engineers

BENEFITS:

  • Fully remote work & flexible hours

  • 37 days/year of vacation & holidays

  • Health insurance allowance for you & dependents

  • Company-provided equipment

  • Well-being, always-be-learning & home office allowances

  • Frequent team get togethers

  • Diverse & inclusive people-first culture

浏览量

0

申请点击

0

Mock Apply

0

收藏

0

关于Poolside

Poolside

Poolside AI or poolside is an American startup developing artificial intelligence to write computer software and coding applications.

1-50

员工数

San Francisco

总部位置

$12B

企业估值

评价

10条评价

3.5

10条评价

工作生活平衡

4.2

薪酬

2.5

企业文化

3.8

职业发展

2.3

管理层

2.7

65%

推荐率

优点

Flexible work arrangements and scheduling

Supportive and friendly team environment

Good work-life balance

缺点

Limited growth and advancement opportunities

Poor management and lack of direction

Low pay and lack of benefits

薪资范围

4个数据点

Senior/L5

Director

Senior/L5 · Performance Director

1份报告

$264,500

年薪总额

基本工资

$230,000

股票

-

奖金

-

$264,500

$264,500