
SeniorAdministrator - Kubernetes, Terraform
About the role
Job Summary
Cloud Site Reliability Engineer (SRE) – AWS (Linux & Windows) We are seeking a highly skilled Cloud Site Reliability Engineer (SRE) to ensure the availability, reliability, and performance of cloud infrastructure and applications across AWS environments. The role focuses on managing production systems, driving operational excellence, and enhancing platform stability through automation and continuous improvement. Cloud Site...neer (SRE) | Word The ideal candidate will have strong expertise in AWS, Linux, and Windows environments, with hands-on experience in supporting hybrid workloads and mission-critical systems. This includes monitoring system health, responding to incidents, performing root cause analysis, and implementing preventive measures to maintain service reliability in line with defined SLAs.
Key Responsibilities
Provide end-to-end operational support for AWS cloud infrastructure and hosted applications • Monitor system health, performance, and availability across Linux and Windows servers • Respond to incidents, troubleshoot issues, and ensure timely resolution as per SLAs • Perform root cause analysis (RCA) and implement preventive measures • Manage and support workloads running on EC2 (Linux & Windows), RDS, and other AWS services • Work closely with application and DevOps teams to improve system reliability • Automate repetitive operational tasks using scripts and tools • Maintain and enhance monitoring, alerting, and logging frameworks • Develop and maintain runbooks and operational documentation • Participate in on-call support rotation Technical Responsibilities:
- Manage and support: o Linux servers (RHEL, Ubuntu, Amazon Linux) o Windows servers (Windows Server, IIS, .NET apps) • Implement monitoring using: o AWS CloudWatch, Zabbix, Prometheus, Grafana, or similar tools • Automate tasks using: o Shell scripting (Linux) o PowerShell (Windows) • Work with CI/CD pipelines for deployment and automation • Manage infrastructure using Terraform / CloudFormation (basic to intermediate) • Ensure system security, patching coordination, and compliance • Handle log analysis using tools like CloudWatch Logs, ELK, or Open Search • Define and track SLIs, SLOs, and SLAs
Skill Requirements
Strong hands-on experience with AWS (EC2, VPC, S3, IAM, RDS, CloudWatch) • Solid experience in Linux administration (mandatory) • Working knowledge of Windows Server administration (mandatory) • Experience in incident management and production support • Familiar in OS Patching Linux and Windows Server • Knowledge of monitoring & alerting tools • Scripting skills: o Bash (Linux) o PowerShell (Windows) • Understanding of SRE principles (availability, reliability, automation) • Experience in troubleshooting performance and system issues ________________________________________ Good to Have Skills:
- Experience with Docker / Kubernetes (EKS) • Knowledge of Infrastructure as Code (Terraform / CloudFormation) • Exposure to DevOps practices and CI/CD tools • Experience in multi-account AWS environments • Basic understanding of networking and security best practices
Other Requirements
- AWS Certified Solutions Architect – Associate • AWS Certified Sys Ops Administrator / DevOps Engineer • ITIL Foundation • Windows/Linux certification (optional but beneficial)
Required skills
AWS
Kubernetes
Terraform
Linux
Windows
SRE
Monitoring
RCA
About HCL Technologies
Chennai
Headquarters