招聘
福利待遇
•Healthcare
•401(k)
•Parental Leave
•Learning
必备技能
SRE
DevOps
Linux
AWS
Docker
Kubernetes
Monitoring
Leadership
Incident Management
Job Overview We are seeking an experienced Sr.
Engineer, Site Reliability (SRE) to drive technical excellence within our global Site Reliability Engineering organization.
This role is essential to maintaining and improving the reliability, scalability, and performance of our multi-cloud SaaS platform serving thousands of customers worldwide.
The successful candidate will provide hands-on technical expertise in incident response, system optimization, and reliability engineering practices across our complex technology stack.
Off hours support as needed.
This is a hybrid position based at Windmill Lane, Dublin 2, our strategic hub for AI development in Ireland.
Responsibilities:
Technical Leadership:
Provide technical guidance within a team of 5+ SRE engineers across one or more geographic regions (US, Ireland, or India) Provide technical mentorship and skill development for team members Contribute to technical decision-making for complex reliability and performance challenges Conduct architecture reviews and provide guidance on system design for reliability Facilitate post-incident reviews and support implementation of preventive measures Incident Management & Response: Participate in enterprise-wide incident management, ensuring rapid prevention, detection, response, and resolution Develop and maintain runbooks and emergency response procedures Conduct root cause analysis and ensure comprehensive documentation Participate in 24/7 on-call rotation and escalation procedures across global teams Interface with Engineering teams and Incident Manager during critical incident resolution Platform Reliability & Performance: Monitor and optimize multi-cloud infrastructure (AWS primary, Azure, GCP) Ensure reliability of core services: AWS resources, Auth0/Okta authentication, databases (SQL Server, PostgreSQL, MongoDB), and legacy Java applications Implement and maintain SLIs, SLOs, and error budgets for assigned services Drive capacity planning and performance optimization initiatives Automation & Tooling: Design automation solutions to reduce manual operational overhead Develop monitoring strategies using New Relic, Grafana, and Sumo Logic Create infrastructure-as-code for reliable deployments Build self-healing systems and automated remediation workflows Qualifications
Technical Experience:
6+ years in SRE, DevOps, or Infrastructure Engineering roles with 2+ years in senior positions Deep hands-on experience with multi-cloud environments (AWS required, Azure preferred) Strong Linux system administration and troubleshooting
Experience: with containerization (Docker) and orchestration (Kubernetes, ECS) Proficiency with monitoring tools (New Relic, Grafana, Prometheus) Leadership & Communication: Proven track record mentoring and guiding technical teams
Experience: serving as technical expert during critical incidents Strong communication skills with engineering teams and stakeholders Cross-functional collaboration in agile environments SRE & Operations: Demonstrated success implementing SRE principles in large-scale production environments
Experience: with ITIL frameworks and tools Background in establishing and maintaining SLAs for enterprise SaaS products Education/Certifications/Licenses: Bachelor’s degree in computer science, Engineering, Information Systems, or related technical field Equivalent combination of education and experience will be considere Preferred Authentication and identity management systems knowledge Infrastructure-as-code tools (Terraform, CloudFormation) Education/Certifications/Licenses: Cloud certifications (AWS, Azure, or Google Cloud) Kubernetes certifications New Relic/Grafana monitoring certifications Linux certifications (RHCE, LPIC-2
Technical Leadership:
Provide technical guidance within a team of 5+ SRE engineers across one or more geographic regions (US, Ireland, or India) Provide technical mentorship and skill development for team members Contribute to technical decision-making for complex reliability and performance challenges Conduct architecture reviews and provide guidance on system design for reliability Facilitate post-incident reviews and support implementation of preventive measures Incident Management & Response: Participate in enterprise-wide incident management, ensuring rapid prevention, detection, response, and resolution Develop and maintain runbooks and emergency response procedures Conduct root cause analysis and ensure comprehensive documentation Participate in 24/7 on-call rotation and escalation procedures across global teams Interface with Engineering teams and Incident Manager during critical incident resolution Platform Reliability & Performance: Monitor and optimize multi-cloud infrastructure (AWS primary, Azure, GCP) Ensure reliability of core services: AWS resources, Auth0/Okta authentication, databases (SQL Server, PostgreSQL, MongoDB), and legacy Java applications Implement and maintain SLIs, SLOs, and error budgets for assigned services Drive capacity planning and performance optimization initiatives Automation & Tooling: Design automation solutions to reduce manual operational overhead Develop monitoring strategies using New Relic, Grafana, and Sumo Logic Create infrastructure-as-code for reliable deployments Build self-healing systems and automated remediation workflows
Technical Experience:
6+ years in SRE, DevOps, or Infrastructure Engineering roles with 2+ years in senior positions Deep hands-on experience with multi-cloud environments (AWS required, Azure preferred) Strong Linux system administration and troubleshooting
Experience: with containerization (Docker) and orchestration (Kubernetes, ECS) Proficiency with monitoring tools (New Relic, Grafana, Prometheus) Leadership & Communication: Proven track record mentoring and guiding technical teams
Experience: serving as technical expert during critical incidents Strong communication skills with engineering teams and stakeholders Cross-functional collaboration in agile environments SRE & Operations: Demonstrated success implementing SRE principles in large-scale production environments
Experience: with ITIL frameworks and tools Background in establishing and maintaining SLAs for enterprise SaaS products Education/Certifications/Licenses: Bachelor’s degree in computer science, Engineering, Information Systems, or related technical field Equivalent combination of education and experience will be considere
总浏览量
0
申请点击数
0
模拟申请者数
0
收藏
0
相似职位
关于iCIMS

iCIMS
Series F+iCIMS, Inc. is a New Jersey-based cloud-based human resources and recruiting software company. The company name is an acronym for Internet Collaborative Information Management Systems.
501-1,000
员工数
Holmdel
总部位置
评价
4.1
31条评价
工作生活平衡
3.7
薪酬
4.5
企业文化
4.4
职业发展
4.1
管理层
3.8
84%
推荐给朋友
优点
Strong engineering culture with focus on code quality
Competitive compensation packages with equity
Flexible remote work options and good work-life balance
缺点
Work-life balance can be challenging during product launches
Fast-paced environment with tight deadlines
Organizational changes and restructuring can be disruptive
薪资范围
0个数据点
Junior/L3
Intern
Junior/L3 · Technical Account Manager
0份报告
$68,340
年薪总额
基本工资
-
股票
-
奖金
-
$58,089
$78,591
面试经验
7次面试
难度
3.1
/ 5
时长
14-28周
录用率
43%
体验
正面 43%
中性 28%
负面 29%
面试流程
1
Application
2
Screening Call
3
Interview
4
Assessment
常见问题
Case Study
Technical Assessment
ATS Screening
新闻动态
Healthcare Provider Triples Talent Pipeline and Boosts Applicants Per Opening 17% with ICIMS Candidate Experience Management - PR Newswire
PR Newswire
News
·
3w ago
ICIMS Launches Purpose-Built Hiring Solution for Recruiting Frontline Workers - Supply & Demand Chain Executive
Supply & Demand Chain Executive
News
·
4w ago
Breaking News: What ICIMS March Workforce Report reveals that BLS doesn’t | ep117 - collegerecruiter.com
collegerecruiter.com
News
·
5w ago
ICIMS’ spring release focuses on frontline hiring ‘experience layer’ - AIM Group
AIM Group
News
·
5w ago



