Jobs
Benefits & Perks
•Healthcare
•Life Insurance
•Parental Leave
•Learning Budget
•Unlimited PTO
•Wellness
•Healthcare
•Parental Leave
•Learning
•Unlimited Pto
•Gym
Required Skills
SRE
DevOps
AWS
Linux
Docker
Kubernetes
Monitoring
Incident response
Job Overview We are seeking an experienced Engineer, Site Reliability (SRE) to drive technical excellence within our global Site Reliability Engineering organization.
This role is essential to maintaining and improving the reliability, scalability, and performance of our multi-cloud SaaS platform serving thousands of customers worldwide.
The successful candidate will provide hands-on technical expertise in incident response, system optimization, and reliability engineering practices across our complex technology stack.
Off hours support as needed.
About Us When you join iCIMS, you join the team helping global companies transform business and the world through the power of talent.
Our customers do amazing things:
design rocket ships, create vaccines, deliver consumer goods globally, overnight, with a smile.
As the Talent Cloud company, we empower these organizations to attract, engage, hire, and advance the right talent.
We’re passionate about helping companies build a diverse, winning workforce and about building our home team.
We're dedicated to fostering an inclusive, purpose-driven, and innovative work environment where everyone belongs.
Responsibilities:
Technical Leadership:
Contribute as part of a team of 15+ SRE engineers across one or more geographic regions (US, Ireland, or India) Provide technical mentorship and knowledge sharing for team members Contribute to technical decision-making for complex reliability and performance challenges Participate in architecture reviews and provide input on system design for reliability Participate in post-incident reviews and support implementation of preventive measures Incident Management & Response: Participate in enterprise-wide incident management, ensuring rapid prevention, detection, response, and resolution Develop and maintain runbooks and emergency response procedures Contribute to root cause analysis and support comprehensive documentation Participate in 24/7 on-call rotation and escalation procedures across global teams Interface with Engineering teams and Incident Manager during critical incident resolution Platform Reliability & Performance: Monitor and optimize multi-cloud infrastructure (AWS primary, Azure, GCP) Ensure reliability of core services: AWS resources, Auth0/Okta authentication, databases (SQL Server, PostgreSQL, MongoDB), and legacy Java applications Implement and maintain SLIs, SLOs, and error budgets for assigned services Drive capacity planning and performance optimization initiatives Automation & Tooling: Design automation solutions to reduce manual operational overhead Develop monitoring strategies using New Relic, Grafana, and Sumo Logic Create infrastructure-as-code for reliable deployments Build self-healing systems and automated remediation workflows Qualifications
Technical Experience:
5+ years in SRE, DevOps, Software Development or Infrastructure Engineering roles Deep hands-on experience with multi-cloud environments (AWS required, Azure preferred) Strong Linux system administration and troubleshooting
Experience: with containerization (Docker) and orchestration (Kubernetes, ECS) Proficiency with monitoring tools (New Relic, Grafana, Prometheus) Leadership & Communication: Demonstrated ability to collaborate with and mentor peers on technical teams
Experience: as incident response participant during critical incidents Strong communication skills with engineering teams and stakeholders Cross-functional collaboration in agile environments SRE & Operations: Demonstrated success implementing SRE principles in large-scale production environments
Experience: with ITIL frameworks and tools Background in establishing and maintaining SLAs for enterprise SaaS products Education/Certifications/Licenses: Bachelor’s degree in computer science, Engineering, Information Systems, or related technical field Equivalent combination of education and experience will be considered Preferred Cloud certifications (AWS, Azure, or Google Cloud) Kubernetes certifications New Relic/Grafana monitoring certifications Linux certifications (RHCE, LPIC-2) EEO Statement iCIMS is a place where everyone belongs.
We celebrate diversity and are committed to creating an inclusive environment for all employees.
Our approach helps us to build a winning team that represents a variety of backgrounds, perspectives, and abilities.
So, regardless of how your diversity expresses itself, you can find a home here at iCIMS.
We prohibit discrimination and harassment of any kind based on race, color, religion, national origin, sex (including pregnancy), sexual orientation, gender identity, gender expression, age, veteran status, genetic information, disability, or other applicable legally protected characteristics.
If you’d like to request an accommodation due to a disability, please contact us at careers@icims.com.
Compensation and Benefits Competitive health and wellness benefits include medical insurance (employee and dependent family members), personal accident and group term life insurance, bonding and parental leave, lifestyle spending account reimbursements, wellness services offerings, sick and casual/emergency days, paid holidays, tuition reimbursement, retirals (PF - employer contribution) and gratuity.
Benefits and eligibility may vary by location, role, and tenure.
Learn more here:
https://careers.icims.com/benefits
Technical Leadership:
Contribute as part of a team of 15+ SRE engineers across one or more geographic regions (US, Ireland, or India) Provide technical mentorship and knowledge sharing for team members Contribute to technical decision-making for complex reliability and performance challenges Participate in architecture reviews and provide input on system design for reliability Participate in post-incident reviews and support implementation of preventive measures Incident Management & Response: Participate in enterprise-wide incident management, ensuring rapid prevention, detection, response, and resolution Develop and maintain runbooks and emergency response procedures Contribute to root cause analysis and support comprehensive documentation Participate in 24/7 on-call rotation and escalation procedures across global teams Interface with Engineering teams and Incident Manager during critical incident resolution Platform Reliability & Performance: Monitor and optimize multi-cloud infrastructure (AWS primary, Azure, GCP) Ensure reliability of core services: AWS resources, Auth0/Okta authentication, databases (SQL Server, PostgreSQL, MongoDB), and legacy Java applications Implement and maintain SLIs, SLOs, and error budgets for assigned services Drive capacity planning and performance optimization initiatives Automation & Tooling: Design automation solutions to reduce manual operational overhead Develop monitoring strategies using New Relic, Grafana, and Sumo Logic Create infrastructure-as-code for reliable deployments Build self-healing systems and automated remediation workflows
Technical Experience:
5+ years in SRE, DevOps, Software Development or Infrastructure Engineering roles Deep hands-on experience with multi-cloud environments (AWS required, Azure preferred) Strong Linux system administration and troubleshooting
Experience: with containerization (Docker) and orchestration (Kubernetes, ECS) Proficiency with monitoring tools (New Relic, Grafana, Prometheus) Leadership & Communication: Demonstrated ability to collaborate with and mentor peers on technical teams
Experience: as incident response participant during critical incidents Strong communication skills with engineering teams and stakeholders Cross-functional collaboration in agile environments SRE & Operations: Demonstrated success implementing SRE principles in large-scale production environments
Experience: with ITIL frameworks and tools Background in establishing and maintaining SLAs for enterprise SaaS products Education/Certifications/Licenses: Bachelor’s degree in computer science, Engineering, Information Systems, or related technical field Equivalent combination of education and experience will be considered
Total Views
0
Apply Clicks
0
Mock Applicants
0
Scraps
0
Similar Jobs

Autonomy Engineer, Navigation (R4171)
Shield AI · United States

Lunar Terrain Vehicle (LTV) Project Engineer (NASA LTVS Award Contingent)
Axiom Space · Houston

Electrical Engineer I - Power Electronics (R3932)
Shield AI · Dallas, Texas

Staff Software Engineer, Full Stack
Credit Karma · Bengaluru, Karnataka, India

Flight Test Engineer - Device Platform
Skydio · San Mateo, California, United States
About iCIMS
Reviews
4.1
31 reviews
Work Life Balance
3.7
Compensation
4.5
Culture
4.4
Career
4.1
Management
3.8
84%
Recommend to a Friend
Pros
Strong engineering culture with focus on code quality
Competitive compensation packages with equity
Flexible remote work options and good work-life balance
Cons
Work-life balance can be challenging during product launches
Fast-paced environment with tight deadlines
Organizational changes and restructuring can be disruptive
Salary Ranges
0 data points
Junior/L3
Intern
Junior/L3 · Technical Account Manager
0 reports
$68,340
total / year
Base
-
Stock
-
Bonus
-
$58,089
$78,591
Interview Experience
7 interviews
Difficulty
3.1
/ 5
Duration
14-28 weeks
Offer Rate
43%
Experience
Positive 43%
Neutral 28%
Negative 29%
Interview Process
1
Application
2
Screening Call
3
Interview
4
Assessment
Common Questions
Case Study
Technical Assessment
ATS Screening
News & Buzz
iCIMS Named a Talent Acquisition Leader by Nucleus Research for Sixth Consecutive Year - PR Newswire
Source: PR Newswire
News
·
7w ago
iCIMS Named #1 in ATS Market Share by APPS RUN THE WORLD - PR Newswire
Source: PR Newswire
News
·
12w ago
The AI-empowered talent portfolio manager - Fast Company
Source: Fast Company
News
·
15w ago
Why HR should care about AI data centers. Plus, news from Eightfold, iCIMS and more - HR Executive
Source: HR Executive
News
·
19w ago
