Senior Site Reliability Engineer Lead

RoleInfrastructure

LevelLead

LocationHyderabad, India

WorkOn-site

TypeFull-time

Posted1 day ago

Apply now

About the role

Job Summary

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.

Key Responsibilities:

Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas:

Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

Resource Health by Zone/Region: Monitor for zone-specific outages.
Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

Packet Loss Rate: Dropped packets during transmission.
Latency / Round-Trip Time (RTT): Network travel time.
Network Throughput: Data transfer rate (Bytes In/Out).
Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

Autoscaling Events:
Managed Instance Groups (MIGs): Number of VMs added/removed.
GKE Cluster Autoscaler: Node pool size changes.
Nodes / VMs:
CPU Utilization & Load.
Memory Utilization.
Disk Space Utilization & Disk I/O.

GKE / Application Layer

Pods / Containers:
Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
Container Restarts.
CPU & Memory Usage vs. Requests/Limits.
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
Horizontal Pod Autoscaler (HPA):
Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

Caching (e.g., Memorystore):
Cache Hit Ratio (Hits vs. Misses).
Latency & Active Connections.
Messaging (e.g., Kafka, Pub/Sub):
Consumer Lag (critical).
Producer/Consumer Throughput.
Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

Cloud Load Balancing:

Request Count & Latency.
HTTP Error Codes (5xx, 4xx).
Cloud SQL (Databases):
CPU & Memory Utilization.
Active Connections & Replication Lag

Key Responsibilities

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.

Key Responsibilities:

Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas:

Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

Resource Health by Zone/Region: Monitor for zone-specific outages.
Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

Packet Loss Rate: Dropped packets during transmission.
Latency / Round-Trip Time (RTT): Network travel time.
Network Throughput: Data transfer rate (Bytes In/Out).
Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

Autoscaling Events:
Managed Instance Groups (MIGs): Number of VMs added/removed.
GKE Cluster Autoscaler: Node pool size changes.
Nodes / VMs:
CPU Utilization & Load.
Memory Utilization.
Disk Space Utilization & Disk I/O.

GKE / Application Layer

Pods / Containers:
Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
Container Restarts.
CPU & Memory Usage vs. Requests/Limits.
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
Horizontal Pod Autoscaler (HPA):
Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

Caching (e.g., Memorystore):
Cache Hit Ratio (Hits vs. Misses).
Latency & Active Connections.
Messaging (e.g., Kafka, Pub/Sub):
Consumer Lag (critical).
Producer/Consumer Throughput.
Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

Cloud Load Balancing:

Request Count & Latency.
HTTP Error Codes (5xx, 4xx).
Cloud SQL (Databases):
CPU & Memory Utilization.
Active Connections & Replication Lag

Skill Requirements

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.

Key Responsibilities:

Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas:

Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

Resource Health by Zone/Region: Monitor for zone-specific outages.
Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

Packet Loss Rate: Dropped packets during transmission.
Latency / Round-Trip Time (RTT): Network travel time.
Network Throughput: Data transfer rate (Bytes In/Out).
Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

Autoscaling Events:
Managed Instance Groups (MIGs): Number of VMs added/removed.
GKE Cluster Autoscaler: Node pool size changes.
Nodes / VMs:
CPU Utilization & Load.
Memory Utilization.
Disk Space Utilization & Disk I/O.

GKE / Application Layer

Pods / Containers:
Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
Container Restarts.
CPU & Memory Usage vs. Requests/Limits.
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
Horizontal Pod Autoscaler (HPA):
Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

Caching (e.g., Memorystore):
Cache Hit Ratio (Hits vs. Misses).
Latency & Active Connections.
Messaging (e.g., Kafka, Pub/Sub):
Consumer Lag (critical).
Producer/Consumer Throughput.
Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

Cloud Load Balancing:

Request Count & Latency.
HTTP Error Codes (5xx, 4xx).
Cloud SQL (Databases):
CPU & Memory Utilization.
Active Connections & Replication Lag

Other Requirements

Position: Senior Observability Engineer (GCP , K8S, Dynatrace)

Position Overview:

We are seeking an experienced Senior Observability Engineer to spearhead our monitoring strategy within the Google Cloud Platform (GCP). This role is critical for ensuring the performance, reliability, and health of our entire cloud infrastructure. The ideal candidate will be an expert in implementing and managing Dynatrace in a GCP environment to create a world-class, end-to-end observability practice.

Key Responsibilities:

Architect and Implement Dynatrace on GCP: Lead the deployment, configuration, and lifecycle management of the Dynatrace platform within our GCP ecosystem, including performing installations and managing upgrades.
Establish End-to-End Observability: Drive the strategy to achieve comprehensive, end-to-end visibility across our applications and infrastructure using Dynatrace. This includes creating mirrored dashboards and key metrics within GCP's native Cloud Monitoring service for consolidated reporting.
Develop Executive Dashboards: Design and maintain high-level executive dashboards that provide a single-pane-of-glass view of overall infrastructure health, service-level objectives (SLOs), and key performance indicators (KPIs).

Below are the detailed expectations to build GCP Observability:

GCP Account & Quotas:

Quota Limit Utilization: Usage vs. limits for critical resources (e.g., CPUs, IP addresses).

Regional / Zonal Monitoring

Resource Health by Zone/Region: Monitor for zone-specific outages.
Inter-Zone Latency: Communication delay between different zones.

Network (VPC, Interconnects)

Packet Loss Rate: Dropped packets during transmission.
Latency / Round-Trip Time (RTT): Network travel time.
Network Throughput: Data transfer rate (Bytes In/Out).
Firewall Rule Deny Count: Blocked connection attempts.

Compute & GKE (Infrastructure Layer)

Autoscaling Events:
Managed Instance Groups (MIGs): Number of VMs added/removed.
GKE Cluster Autoscaler: Node pool size changes.
Nodes / VMs:
CPU Utilization & Load.
Memory Utilization.
Disk Space Utilization & Disk I/O.

GKE / Application Layer

Pods / Containers:
Pod Kill Reasons: OOMKilled (Out of Memory), Evicted, etc.
Container Restarts.
CPU & Memory Usage vs. Requests/Limits.
Garbage Collection (GC): Pause duration and frequency (for JVM, Go, etc.).
Horizontal Pod Autoscaler (HPA):
Current vs. Desired Pod Replicas.

Middleware (Caching, Messaging)

Caching (e.g., Memorystore):
Cache Hit Ratio (Hits vs. Misses).
Latency & Active Connections.
Messaging (e.g., Kafka, Pub/Sub):
Consumer Lag (critical).
Producer/Consumer Throughput.
Under-replicated Partitions (Kafka).

Managed Services (Load Balancer, Cloud SQL)

Cloud Load Balancing:

Request Count & Latency.
HTTP Error Codes (5xx, 4xx).
Cloud SQL (Databases):
CPU & Memory Utilization.
Active Connections & Replication Lag

Required skills

GCP

Dynatrace

Observability

Monitoring

Kubernetes