HCL Technologies

Senior Site Reliability Engineer Lead

RoleInfrastructure

LevelLead

LocationDublin, Ireland

WorkOn-site

TypeFull-time

Posted1 month ago

Apply now

About the role

Job Summary

As a Site Reliability Engineer supporting MQ, NATS/Event Broker, you will be responsible for the stability and resilience of Mastercard’s messaging backbone. You will partner closely with application teams, platform engineering, infrastructure, and security teams to reduce operational risk, improve system reliability, and ensure issues are detected and resolved before customer impact.

This is a production‑focused engineering role, not an application development role.

Key Responsibilities

Ensure high availability, performance, and resilience of MQ, NATS/Event Broker platforms across environments.
Participate in on‑call rotations and provide hands‑on support during production incidents.
Lead or contribute to incident triage, mitigation, and service restoration.
Perform root cause analysis (RCA) and drive corrective and preventive actions to closure.
Design, implement, and maintain monitoring, alerting, and dashboards to enable proactive detection.
Support and govern production changes, including upgrades, patching, certificate renewals, and configuration changes.
Assess operational readiness for changes and ensure rollback and validation plans are in place.
Automate operational tasks and workflows to reduce manual effort and improve recovery times.
Partner with application teams to support onboarding, scaling, and operational best practices.
Create and maintain runbooks, SOPs, and operational documentation.
Contribute to continuous improvement of reliability, observability, and operational processes.

Skill Requirements

Experience supporting mission‑critical production systems with on‑call responsibility.
Strong understanding of distributed systems and messaging platforms.
Hands‑on experience with MQ, NATS/Event Broker, or similar middleware technologies.
Experience with monitoring, logging, and alerting tools.
Proficiency in at least one scripting or programming language (e.g., Python, Bash, Java).
Solid knowledge of Linux, networking fundamentals, and system troubleshooting.
Ability to troubleshoot complex, multi‑component issues under pressure.
Experience operating enterprise‑scale messaging or event‑driven platforms.
Familiarity with clustering, replication, persistence, and high‑availability patterns.
Experience working in regulated environments with strong change management practices.
Exposure to automation, reliability engineering, or SRE best practices.

Other Requirements

null

Required skills

Site Reliability Engineering

About HCL Technologies

HCL Technologies

Dublin

Headquarters