採用

Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)
Remote, Ontario, Canada
·
Remote
·
Full-time
·
1mo ago
福利厚生
•Remote Work
必須スキル
Site Reliability Engineering
Incident Management
Distributed Systems
Observability
Kubernetes
AWS
GCP
Azure
CI/CD
Technical Writing
We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.
It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.
One Confluent. One Team. One Data Streaming Platform.
ABOUT THE ROLE:
Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale—data in motion, exactly-once semantics, and cascading failure modes that require deep systems thinking. We need an expert-level engineer who can drive proactive reliability improvements that prevent these incidents before they occur.
This role combines hands-on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post-mortems, training incident commanders, and evolving our incident response practices.
- You'll be part of a global team with follow-the-sun coverage, with clean handoffs that keep everyone working sustainable hours. This role sits within Cloud Architecture and Reliability
- Supportability, a horizontal team that owns reliability standards and tooling across engineering. You're the person who makes us need incident management less.
WHAT YOU WILL DO:
-
Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
-
Own Rootly configuration, workflows, and integrations with Pager Duty, Jira, Confluence, and Slack
-
Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
-
Own standards, practices, and continuous improvement of incident response across engineering
-
Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
-
Develop and deliver training programs; coach teams through post-mortems
-
Partner with engineering leaders to elevate reliability practices org-wide
WHAT YOU WILL BRING:
-
10+ years of relevant experience in SRE, incident management, or reliability engineering
-
Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
-
Experience navigating reliability/incident programs at 500+ engineer organizations
-
Deep expertise with incident management tooling (Rootly, Pager Duty, or similar)
-
Strong understanding of distributed systems and failure modes at scale
-
Deep experience with observability: metrics, logging, tracing
-
Kubernetes and container orchestration experience
-
Understanding of CI/CD pipelines and release processes
-
Strong written communication (design docs, runbooks, post-mortems)
-
Experience driving org-wide process and cultural changes
-
Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems
READY TO BUILD WHAT'S NEXT? LET’S GET IN MOTION.
COME AS YOU ARE:
Belonging isn’t a perk here. It’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what’s possible.
We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.
総閲覧数
0
応募クリック数
0
模擬応募者数
0
スクラップ
0
類似の求人

Senior Site Reliability Engineer, Platform & Cloud FinOps (100% Remote - Toronto)
Hopper · Toronto - Remote

Senior Site Reliability Engineer (SRE)
Samsung · 565 Great Northern Way, Vancouver, Canada

Principal DevOps Developer
Autodesk · Montreal; Ontario

Senior AI Platform Engineer
MaintainX · Montreal, Toronto, Vancouver, SF (Remote)

Senior DevOps Engineer - Virtualization and SIL Integration
General Motors · Markham, Ontario, Canada
Confluentについて

Confluent
PublicConfluent, Inc. is an American technology company headquartered in Mountain View, California. Confluent was founded by Jay Kreps, Jun Rao and Neha Narkhede on September 23, 2014, in order to commercialize an open-source streaming platform Apache Kafka, created by the same founders while working at...
1,001-5,000
従業員数
Mountain View
本社所在地
$4.6B
企業価値
レビュー
3.7
10件のレビュー
ワークライフバランス
3.2
報酬
3.8
企業文化
4.1
キャリア
3.4
経営陣
2.8
68%
友人に勧める
良い点
Flexible working hours and remote work options
Supportive and friendly team dynamics
Good learning opportunities and new technologies
改善点
Heavy and unpredictable workload
Poor management and lack of leadership direction
High pressure and fast-paced environment
給与レンジ
43件のデータ
Mid/L4
Mid/L4 · SECURITY ENGINEER
1件のレポート
$208,000
年収総額
基本給
$160,000
ストック
-
ボーナス
-
$208,000
$208,000
面接体験
2件の面接
難易度
3.0
/ 5
期間
14-28週間
内定率
50%
体験
ポジティブ 50%
普通 50%
ネガティブ 0%
面接プロセス
1
Application Review
2
Recruiter Screen
3
Online Assessment
4
Technical Interview
5
Final Round Interview
6
Offer
よくある質問
Coding/Algorithm
System Design
Behavioral/STAR
Technical Knowledge
ニュース&話題
+3.14% for IBM stock as Confluent acquisition boosts confidence - Traders Union
Traders Union
News
·
1w ago
IBM set for in-line Q1 as Confluent deal boosts outlook - Proactive financial news
Proactive financial news
News
·
1w ago
BofA cuts IBM stock price target on Confluent integration timing - Investing.com
Investing.com
News
·
1w ago
Confluent Inc. (NASDAQ:CFLT) Shows Strong Technical and Growth Momentum Alignment - ChartMill
ChartMill
News
·
3w ago