
HCL Technologies
Track Lead - Azure DevOps, Terraform
RoleDevops
LevelLead
LocationCalgary, Canada
WorkOn-site
TypeFull-time
Posted1 week ago
About the role
Job Summary
The SRE will be responsible for the reliability, availability, and performance of Azure/AWS PaaS and IaaS workloads. They bridge the gap between development and operations, focusing on building automated systems that prevent failures, managing incident responses, and optimizing cloud costs.
Key Responsibilities
- System Reliability & Monitoring: Design, implement, and maintain comprehensive monitoring and alerting systems such as Azure Monitor, AWS CloudWatch, Application Insights, and Log Analytics.
- Automation & Toil Reduction: Automate repetitive manual operations (toil) such as environment provisioning, system patching, and scaling. Use IaC tools like Terraform and Ansible to manage infrastructure.
- Incident Response & Management: Actively manage incident responses, root cause analysis (RCA), and post-mortem investigations to improve system reliability and minimize mean time to resolution (MTTR).
- Cloud SRE Agent Integration: Deploy and configure Cloud SRE Agent to automate incident investigation, execute remediation steps (restart, scale, rollback), and manage routine tasks.
- Capacity Planning & Scalability: Analyze usage patterns to optimize cloud resources, ensuring high availability and performance while managing costs via Azure Cost Management.
- CI/CD & DevOps Collaboration: Integrate automation workflows into CI/CD pipelines (e.g., GitHub Actions or Azure Pipelines) to ensure reliable deployments.
Skill Requirements
- Cloud Platforms: Expert knowledge of Microsoft Azure infrastructure services (Compute, Storage, Networking, AKS).
- Scripting & Programming: Proficiency in Python, Bash, or PowerShell for building automation tools.
- Infrastructure as Code (IaC): Extensive experience with Terraform and ARM templates/Bicep.
- Observability Tools: Experience with Azure Monitor, Grafana, Prometheus, or Datadog.
- Containers & Orchestration: Solid understanding of Kubernetes/AKS (Azure Kubernetes Service).
- Operating Systems: Proficient in Windows/Linux environments.
- Azure Certification is a + • Exposure to multi Cloud environment is must.
Other Requirements
Reviewing Service Level Objectives (SLOs) and error budgets. 2. Refining auto-scaling rules for Kubernetes clusters based on traffic trends. 3. Working with developers to review service architecture and ensure fault tolerance. 4. Configuring AI-driven alert suppression to reduce alert fatigue. 5. Creating Azure Dashboards to visualize key performance indicators (KPIs).
Required skills
Azure DevOps
Terraform
CI/CD
Automation
About HCL Technologies
Calgary
Headquarters