채용

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)
San Francisco, CA - US
·
On-site
·
Full-time
·
1mo ago
필수 스킬
Python
Kubernetes
Go
Terraform
Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.
Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.
About the Role:
We are looking for a highly skilled engineer with deep expertise in building and operating observability platforms at scale. You will design, develop, and run Crusoe’s next-generation observability stack, enabling engineers to understand the internal state of distributed systems through metrics, logs, and traces. Your work will ensure reliability, performance, and actionable insights across Crusoe’s global infrastructure and cloud platform.
What You’ll Be Working On:
-
Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
-
Architecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualization
-
Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and Open Telemetry
-
Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
-
Implementing distributed tracing platforms (Tempo, Jaeger, Open Telemetry) and integrating with service meshes, load balancers, and APIs
-
Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
-
Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)
-
Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)
-
Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls
-
Partnering with engineering teams to embed observability into applications, services, and infrastructure
-
Mentoring engineers and shaping Crusoe’s observability strategy and technical roadmap
What You’ll Bring to the Team:
-
7+ years of experience in infrastructure or platform engineering, with a focus on observability and monitoring systems
-
Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, Open Telemetry)
-
Strong programming skills in Go or Python for automation, operators, and custom integrations
-
Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
-
Proven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput data
-
Solid understanding of distributed systems, performance engineering, and debugging complex workloads
-
Strong collaboration skills and the ability to influence engineering teams to adopt observability best practices
Bonus Points:
-
Contributions to open source observability projects (Prometheus, Open Telemetry, Grafana, Loki, etc.)
-
Experience supporting AI/ML or GPU-heavy environments with high observability demands
-
Knowledge of event-driven or streaming systems (Kafka, NATS, Pulsar) used in telemetry pipelines
-
Experience implementing cost optimization strategies for large-scale observability platforms
-
Background in incident response, chaos engineering, and reliability practices
Benefits:
-
Industry competitive pay
-
Restricted Stock Units in a fast growing, well-funded technology company
-
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
-
Employer contributions to HSA accounts
-
Paid Parental Leave
-
Paid life insurance, short-term and long-term disability
-
Teladoc
-
401(k) with a 100% match up to 4% of salary
-
Generous paid time off and holiday schedule
-
Cell phone reimbursement
-
Tuition reimbursement
-
Subscription to the Calm app
-
Met Life Legal
-
Company paid commuter benefit; $300 per month
Compensation:
Compensation will be paid in the range of $166,000 - $201,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
총 조회수
0
총 지원 클릭 수
0
모의 지원자 수
0
스크랩
0
비슷한 채용공고

Senior Software Engineer, Cloud Infrastructure / SRE
Oscar Health · San Francisco, California, United States

Senior Build and Release Engineer, Apple Services Engineering
Apple · San Francisco, CA

Sr Principal Site Reliability Engineer
Walt Disney · San Francisco, CA, USA

Senior Data Center Engineer II
DigitalOcean · San Francisco

Staff Software Engineer, Site Reliability Engineer (SRE)
Harvey AI · San Francisco
Crusoe 소개

Crusoe
Series CCrusoe Energy develops cloud computing infrastructure powered by stranded energy sources to support high-performance computing workloads including AI training and cryptocurrency mining.
201-500
직원 수
Denver
본사 위치
$3B
기업 가치
리뷰
3.7
10개 리뷰
워라밸
3.2
보상
2.8
문화
4.1
커리어
3.4
경영진
3.6
65%
친구에게 추천
장점
Supportive colleagues and great teamwork
Supportive management
Good benefits and flexible hours
단점
Heavy workload and overtime requirements
High stress and fast-paced environment
Below industry standard compensation
연봉 정보
28개 데이터
Senior/L5
Senior/L5 · Senior Customer Success Engineer
1개 리포트
$214,500
총 연봉
기본급
$165,000
주식
-
보너스
-
$214,500
$214,500
뉴스 & 버즈
Crusoe Highlights Internal Perspective on Energy-Constrained AI Infrastructure - TipRanks
TipRanks
News
·
6d ago
Crusoe expands AI infrastructure race with 900 MW Abilene build for Microsoft - EdgeIR
EdgeIR
News
·
1w ago
Crusoe's $70M data center to boost Nolan County economy, faces water use restrictions - myfoxzone.com
myfoxzone.com
News
·
1w ago
Crusoe (2008) - IMDb
IMDb
News
·
1w ago