
Senior Site Reliability Engineer Lead
About the role
Job Summary
Position: Senior Observability Engineer (GCP , K8S, Dynatrace)
Position Overview:
We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.
Key Responsibilities:
-
Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
-
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
-
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).
Below are the detailed expectations to build GCP Observability:
GCP Account & Quotas:
- Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).
Regional / Zonal Monitoring
-
Resource Health by Zone/Region: Monitor for zone-specific outages.
-
Inter-Zone Latency: Communication delay between different zones.
Network (VPC, Interconnects)
-
Packet Loss Rate: Dropped packets during transmission.
-
Latency / Round-Trip Time (RTT): Network travel time.
-
Network Throughput: Data transfer rate (Bytes In/Out).
-
Firewall Rule Deny Count: Blocked connection attempts.
Compute & GKE (Infrastructure Layer)
-
Autoscaling Events:
-
Managed Instance Groups (MIGs): Number of VMs added/removed.
-
GKE Cluster Autoscaler: Node pool size changes.
-
Nodes / VMs:
-
CPU Utilization & Load.
-
Memory Utilization.
-
Disk Space Utilization & Disk I/O.
GKE / Application Layer
-
Pods / Containers:
-
Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
-
Container Restarts.
-
CPU & Memory Usage vs. Requests/Limits.
-
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
-
Horizontal Pod Autoscaler (HPA):
-
Current vs. Desired Pod Replicas.
Middleware (Caching, Messaging)
-
Caching (e.g., Memorystore):
-
Cache Hit Ratio (Hits vs. Misses).
-
Latency & Active Connections.
-
Messaging (e.g., Kafka, Pub/Sub):
-
Consumer Lag (critical).
-
Producer/Consumer Throughput.
-
Under-replicated Partitions (Kafka).
Managed Services (Load Balancer, Cloud SQL)
Cloud Load Balancing:
-
Request Count & Latency.
-
HTTP Error Codes (5xx, 4xx).
-
Cloud SQL (Databases):
-
CPU & Memory Utilization.
-
Active Connections & Replication Lag
Key Responsibilities
Position: Senior Observability Engineer (GCP , K8S, Dynatrace)
Position Overview:
We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.
Key Responsibilities:
-
Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
-
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
-
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).
Below are the detailed expectations to build GCP Observability:
GCP Account & Quotas:
- Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).
Regional / Zonal Monitoring
-
Resource Health by Zone/Region: Monitor for zone-specific outages.
-
Inter-Zone Latency: Communication delay between different zones.
Network (VPC, Interconnects)
-
Packet Loss Rate: Dropped packets during transmission.
-
Latency / Round-Trip Time (RTT): Network travel time.
-
Network Throughput: Data transfer rate (Bytes In/Out).
-
Firewall Rule Deny Count: Blocked connection attempts.
Compute & GKE (Infrastructure Layer)
-
Autoscaling Events:
-
Managed Instance Groups (MIGs): Number of VMs added/removed.
-
GKE Cluster Autoscaler: Node pool size changes.
-
Nodes / VMs:
-
CPU Utilization & Load.
-
Memory Utilization.
-
Disk Space Utilization & Disk I/O.
GKE / Application Layer
-
Pods / Containers:
-
Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
-
Container Restarts.
-
CPU & Memory Usage vs. Requests/Limits.
-
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
-
Horizontal Pod Autoscaler (HPA):
-
Current vs. Desired Pod Replicas.
Middleware (Caching, Messaging)
-
Caching (e.g., Memorystore):
-
Cache Hit Ratio (Hits vs. Misses).
-
Latency & Active Connections.
-
Messaging (e.g., Kafka, Pub/Sub):
-
Consumer Lag (critical).
-
Producer/Consumer Throughput.
-
Under-replicated Partitions (Kafka).
Managed Services (Load Balancer, Cloud SQL)
Cloud Load Balancing:
-
Request Count & Latency.
-
HTTP Error Codes (5xx, 4xx).
-
Cloud SQL (Databases):
-
CPU & Memory Utilization.
-
Active Connections & Replication Lag
Skill Requirements
Position: Senior Observability Engineer (GCP , K8S, Dynatrace)
Position Overview:
We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.
Key Responsibilities:
-
Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
-
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
-
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).
Below are the detailed expectations to build GCP Observability:
GCP Account & Quotas:
- Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).
Regional / Zonal Monitoring
-
Resource Health by Zone/Region: Monitor for zone-specific outages.
-
Inter-Zone Latency: Communication delay between different zones.
Network (VPC, Interconnects)
-
Packet Loss Rate: Dropped packets during transmission.
-
Latency / Round-Trip Time (RTT): Network travel time.
-
Network Throughput: Data transfer rate (Bytes In/Out).
-
Firewall Rule Deny Count: Blocked connection attempts.
Compute & GKE (Infrastructure Layer)
-
Autoscaling Events:
-
Managed Instance Groups (MIGs): Number of VMs added/removed.
-
GKE Cluster Autoscaler: Node pool size changes.
-
Nodes / VMs:
-
CPU Utilization & Load.
-
Memory Utilization.
-
Disk Space Utilization & Disk I/O.
GKE / Application Layer
-
Pods / Containers:
-
Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
-
Container Restarts.
-
CPU & Memory Usage vs. Requests/Limits.
-
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
-
Horizontal Pod Autoscaler (HPA):
-
Current vs. Desired Pod Replicas.
Middleware (Caching, Messaging)
-
Caching (e.g., Memorystore):
-
Cache Hit Ratio (Hits vs. Misses).
-
Latency & Active Connections.
-
Messaging (e.g., Kafka, Pub/Sub):
-
Consumer Lag (critical).
-
Producer/Consumer Throughput.
-
Under-replicated Partitions (Kafka).
Managed Services (Load Balancer, Cloud SQL)
Cloud Load Balancing:
-
Request Count & Latency.
-
HTTP Error Codes (5xx, 4xx).
-
Cloud SQL (Databases):
-
CPU & Memory Utilization.
-
Active Connections & Replication Lag
Other Requirements
Position: Senior Observability Engineer (GCP , K8S, Dynatrace)
Position Overview:
We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.
Key Responsibilities:
-
Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
-
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
-
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).
Below are the detailed expectations to build GCP Observability:
GCP Account & Quotas:
- Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).
Regional / Zonal Monitoring
-
Resource Health by Zone/Region: Monitor for zone-specific outages.
-
Inter-Zone Latency: Communication delay between different zones.
Network (VPC, Interconnects)
-
Packet Loss Rate: Dropped packets during transmission.
-
Latency / Round-Trip Time (RTT): Network travel time.
-
Network Throughput: Data transfer rate (Bytes In/Out).
-
Firewall Rule Deny Count: Blocked connection attempts.
Compute & GKE (Infrastructure Layer)
-
Autoscaling Events:
-
Managed Instance Groups (MIGs): Number of VMs added/removed.
-
GKE Cluster Autoscaler: Node pool size changes.
-
Nodes / VMs:
-
CPU Utilization & Load.
-
Memory Utilization.
-
Disk Space Utilization & Disk I/O.
GKE / Application Layer
-
Pods / Containers:
-
Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
-
Container Restarts.
-
CPU & Memory Usage vs. Requests/Limits.
-
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
-
Horizontal Pod Autoscaler (HPA):
-
Current vs. Desired Pod Replicas.
Middleware (Caching, Messaging)
-
Caching (e.g., Memorystore):
-
Cache Hit Ratio (Hits vs. Misses).
-
Latency & Active Connections.
-
Messaging (e.g., Kafka, Pub/Sub):
-
Consumer Lag (critical).
-
Producer/Consumer Throughput.
-
Under-replicated Partitions (Kafka).
Managed Services (Load Balancer, Cloud SQL)
Cloud Load Balancing:
-
Request Count & Latency.
-
HTTP Error Codes (5xx, 4xx).
-
Cloud SQL (Databases):
-
CPU & Memory Utilization.
-
Active Connections & Replication Lag
Required skills
GCP
Dynatrace
Observability
Monitoring
Kubernetes
About HCL Technologies
Hyderabad
Headquarters