채용

Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)

Confluent

Remote, Ontario, Canada

Remote

Full-time

1mo ago

복지 및 혜택

•Remote Work

필수 스킬

Site Reliability Engineering

Incident Management

Distributed Systems

Observability

Kubernetes

AWS

GCP

Azure

CI/CD

Technical Writing

We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.

ABOUT THE ROLE:

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale—data in motion, exactly-once semantics, and cascading failure modes that require deep systems thinking. We need an expert-level engineer who can drive proactive reliability improvements that prevent these incidents before they occur.

This role combines hands-on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post-mortems, training incident commanders, and evolving our incident response practices.

You'll be part of a global team with follow-the-sun coverage, with clean handoffs that keep everyone working sustainable hours. This role sits within Cloud Architecture and Reliability
Supportability, a horizontal team that owns reliability standards and tooling across engineering. You're the person who makes us need incident management less.

WHAT YOU WILL DO:

Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
Own Rootly configuration, workflows, and integrations with Pager Duty, Jira, Confluence, and Slack
Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
Own standards, practices, and continuous improvement of incident response across engineering
Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
Develop and deliver training programs; coach teams through post-mortems
Partner with engineering leaders to elevate reliability practices org-wide

WHAT YOU WILL BRING:

10+ years of relevant experience in SRE, incident management, or reliability engineering
Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
Experience navigating reliability/incident programs at 500+ engineer organizations
Deep expertise with incident management tooling (Rootly, Pager Duty, or similar)
Strong understanding of distributed systems and failure modes at scale
Deep experience with observability: metrics, logging, tracing
Kubernetes and container orchestration experience
Understanding of CI/CD pipelines and release processes
Strong written communication (design docs, runbooks, post-mortems)
Experience driving org-wide process and cultural changes
Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

READY TO BUILD WHAT'S NEXT? LET’S GET IN MOTION.

COME AS YOU ARE:

Belonging isn’t a perk here. It’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what’s possible.

We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.

총 조회수

총 지원 클릭 수

모의 지원자 수

비슷한 채용공고

Principal DevOps Developer

Autodesk · Montreal; Ontario

Senior Site Reliability Engineer (SRE)

Samsung · 565 Great Northern Way, Vancouver, Canada

Senior DevOps Developer

Fortinet · Burnaby, BC, Canada, CA

Senior AI Platform Engineer

MaintainX · Montreal, Toronto, Vancouver, SF (Remote)

Senior Site Reliability Engineer, Platform & Cloud FinOps (100% Remote - Toronto)

Hopper · Toronto - Remote

Confluent 소개

Confluent

Public

Confluent, Inc. is an American technology company headquartered in Mountain View, California. Confluent was founded by Jay Kreps, Jun Rao and Neha Narkhede on September 23, 2014, in order to commercialize an open-source streaming platform Apache Kafka, created by the same founders while working at...

1,001-5,000

직원 수

Mountain View

본사 위치

$4.6B

기업 가치

리뷰

3.7

10개 리뷰

워라밸

3.2

보상

3.8

문화

4.1

커리어

3.4

경영진

2.8

68%

친구에게 추천

장점

Flexible working hours and remote work options

Supportive and friendly team dynamics

Good learning opportunities and new technologies

단점

Heavy and unpredictable workload

Poor management and lack of leadership direction

High pressure and fast-paced environment

연봉 정보

43개 데이터

Mid/L4

Mid/L4 · SECURITY ENGINEER

1개 리포트

$208,000

총 연봉

기본급

$160,000

주식

보너스

$208,000

면접 경험

2개 면접

난이도

3.0

/ 5

소요 기간

14-28주

합격률

50%

경험

긍정 50%

보통 50%

부정 0%

면접 과정

Application Review

Recruiter Screen

Online Assessment

Technical Interview

Final Round Interview

Offer

자주 나오는 질문

Coding/Algorithm

System Design

Behavioral/STAR

Technical Knowledge

뉴스 & 버즈

+3.14% for IBM stock as Confluent acquisition boosts confidence - Traders Union

Traders Union

News

1w ago

IBM set for in-line Q1 as Confluent deal boosts outlook - Proactive financial news

Proactive financial news

News

1w ago

BofA cuts IBM stock price target on Confluent integration timing - Investing.com

Investing.com

News

1w ago

Confluent Inc. (NASDAQ:CFLT) Shows Strong Technical and Growth Momentum Alignment - ChartMill

ChartMill

News

3w ago