採用
Benefits & Perks
•Healthcare
•401(k)
•Equity
•Flexible Hours
•Parental Leave
•Mental Health
•Learning Budget
•Healthcare
•401k
•Equity
•Flexible Hours
•Parental Leave
•Mental Health
•Learning
Required Skills
Python
Distributed data processing
Machine learning infrastructure
Statistics
Probability theory
Reddit is a community of communities. It’s built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 116 million daily active unique visitors, Reddit is one of the internet’s largest sources of information. For more information, visit www.redditinc.com.
Reddit is continuing to grow our teams with the best talent. This role is completely remote friendly within the United States. If you happen to live close to one of our physical office locations (San Francisco, Los Angeles, New York City & Chicago) our doors are open for you to come into the office as often as you'd like.
The AI Engineering team at Reddit is embarking on a strategic initiative to build our own Reddit-native foundational Large Language Models (LLMs). This team sits at the intersection of applied research and massive-scale infrastructure, tasked with training models that truly understand the unique culture, language, and structure of Reddit communities. You will be joining a team of distinguished engineers and safety experts to build the "engine room" of Reddit's AI future—creating the foundational models that will power Safety & Moderation, Search, Ads, and the next generation of user products.
As a Staff Research Engineer for Pre-training Data, you will define the technical strategy and architecture for the data curriculum pipelines that power our next-generation foundation models. Sitting at the intersection of distributed infrastructure, multimodal processing, and mathematics, you will design systems that transform Reddit’s unique corpus of human conversation—petabytes of text, images, and video—into high-quality training signals. You will move beyond flat text processing to engineer solutions that respect the complex, tree-structured nature of Reddit threads, ensuring our models learn the nuance of community interaction.
Responsibilities:
-
Architect and implement high-throughput, deterministic data sampling systems capable of feeding distributed training clusters at frontier-model scale.
-
Design and execute dynamic curriculum learning strategies, creating systems that automatically adjust data distributions (text vs. multimodal) during training to improve model stability and reasoning capabilities.
-
Engineer the logic for serializing Reddit’s complex conversational trees (threads, subreddits, cross-posts) into optimal training contexts, developing topological data processing strategies that preserve semantic relationships for model understanding.
-
Formulate and validate statistical hypotheses regarding data mixtures, leveraging advanced sampling theory to minimize bias and maximize token quality.
-
Design the "Safety-First" ingestion layer: Build automated pipelines for PII redaction, toxicity signals, and quality deduplication upstream of training, working closely with our Safety and Moderation Engineering counterparts.
-
Bridge the gap between research and engineering by translating theoretical sampling insights into robust, low-latency production infrastructure.
-
Mentor senior engineers and researchers on system design, numerical correctness, and performance optimization within distributed Python/Rust environments.
Required Qualifications:
-
8+ years of software engineering experience with a focus on machine learning infrastructure, data science at scale, or LLM pre-training.
-
Expert proficiency in Python and distributed data processing frameworks (e.g., Ray Data, Spark, or custom high-performance dataloaders).
-
Experience handling Unstructured and Semi-Structured data at scale (not just tabular data)—specifically text, code, images, and audio/video.
-
Strong mathematical foundation in probability, statistics, and importance sampling theory.
-
Deep understanding of pre-training dynamics and the impact of data quality/ordering on model performance.
-
Experience working with Graph data structures or serializing conversation trees is highly valued.
Nice to Have:
-
Experience with JAX or Py Torch internals related to distributed data loading
-
Experience with Multimodal datasets (image/video + text) and vision-language preprocessing.
-
Proficiency in Rust or C++ for performance-critical data path optimization.
-
Published research or significant practical experience in active learning or automated data selection.
Benefits:
-
Comprehensive Healthcare Benefits and Income Replacement Programs
-
401k with Employer Match
-
Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
-
Family Planning Support
-
Gender-Affirming Care
-
Mental Health & Coaching Benefits
-
Flexible Vacation & Paid Volunteer Time Off
-
Generous Paid Parental Leave
Pay Transparency:
This job posting may span more than one career level.
In addition to base salary, this job is eligible to receive equity in the form of restricted stock units, and depending on the position offered, it may also be eligible to receive a commission. Additionally, Reddit offers a wide range of benefits to U.S.-based employees, including medical, dental, and vision insurance, 401(k) program with employer match, generous time off for vacation, and parental leave. To learn more, please visit https://www.redditinc.com/careers/.
To provide greater transparency to candidates, we share base salary ranges for all US-based job postings regardless of state. We set standard base pay ranges for all roles based on function, level, and country location, benchmarked against similar stage growth companies. Final offer amounts are determined by multiple factors including, skills, depth of work experience and relevant licenses/credentials, and may vary from the amounts listed below.
The base salary range for this position is:$230,000—$322,000 USD
In select roles and locations, the interviews will be recorded, transcribed and summarized by artificial intelligence (AI). You will have the opportunity to opt out of recording, transcription and summarization prior to any scheduled interviews.
During the interview, we will collect the following categories of personal information: Identifiers, Professional and Employment-Related Information, Sensory Information (audio/video recording), and any other categories of personal information you choose to share with us. We will use this information to evaluate your application for employment or an independent contractor role, as applicable. We will not sell your personal information or disclose it to any third party for their marketing purposes. We will delete any recording of your interview promptly after making a hiring decision. For more information about how we will handle your personal information, including our retention of it, please refer to our Candidate Privacy Policy for Potential Employees and Contractors.
Reddit is proud to be an equal opportunity employer, and is committed to building a workforce representative of the diverse communities we serve. Reddit is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If, due to a disability, you need an accommodation during the interview process, please let your recruiter know.
Total Views
0
Apply Clicks
0
Mock Applicants
0
Scraps
0
Similar Jobs

Software Engineer 5 - TV Client Foundations
Netflix · USA - Remote

Web Software Engineer 4 - Client API Foundations
Netflix · USA - Remote

2026 Summer - ALD Process Development Engineer Intern - MS/PhD (Detroit, MI)
Applied Materials · Flexible / Remote

Associate Software Maintenance Engineer (Linux, Kubernetes)
Red Hat · Remote Philippines

Object Storage Product Engineer
Allstate · Flexible / Remote
About Reddit

Reddit is an American proprietary social news aggregation and forum social media platform. Registered users submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members.
1,001-5,000
Employees
San Francisco
Headquarters
$10B
Valuation
Reviews
3.8
20 reviews
Work Life Balance
3.9
Compensation
4.0
Culture
3.6
Career
4.1
Management
3.4
75%
Recommend to a Friend
Pros
Interesting projects and challenges
Good work-life balance and flexible environment
Opportunity for career growth
Cons
Room for improvement in processes
Work-life balance varies by team
Some organizational bureaucracy
Salary Ranges
11 data points
Junior/L3
Senior/L5
Junior/L3 · Sales Analytics Automation, Central Partner
1 reports
$222,739
total / year
Base
$193,773
Stock
-
Bonus
-
$222,739
$222,739
Interview Experience
8 interviews
Difficulty
2.3
/ 5
Duration
14-28 weeks
Offer Rate
50%
Experience
Positive 13%
Neutral 0%
Negative 87%
Interview Process
1
Application Review
2
Recruiter Screen
3
Technical Phone Screen
4
Onsite/Virtual Interviews
5
Team Matching
6
Offer
Common Questions
Technical Knowledge
Behavioral/STAR
Past Experience
Culture Fit
News & Buzz
Alexis Ohanian walked out of the LSAT 20 minutes in, went to a Waffle House and decided he was 'gonna invent a career'—he founded Reddit - Fortune
Source: Fortune
News
·
5w ago
Mirae Asset Global Investments Co. Ltd. Grows Stock Holdings in Reddit Inc. $RDDT - MarketBeat
Source: MarketBeat
News
·
5w ago
A resignation and call to conscience at company owned by Maga billionaires - The Guardian
Source: The Guardian
News
·
5w ago
Restaurant Closes After Reddit Mocks $22 Grilled Cheese - Entrepreneur
Source: Entrepreneur
News
·
5w ago