HCL Technologies

Senior Apache Spark Technical Lead - Scala, Python

RoleData Engineering

LevelLead

LocationAmbattur, India

WorkOn-site

TypeFull-time

Posted1 week ago

Apply now

About the role

Job Summary

Job Description: Senior Data Engineer (Py Spark / Dataproc / GCP) We are looking for a Senior Data Engineer with strong hands-on expertise in Python (Py Spark) and Google Cloud Dataproc to design, develop, and operate scalable data pipelines on Google Cloud Platform. This role focuses on building reliable, production-grade data solutions across batch and streaming use cases.

Key Responsibilities

Key Responsibilities• Design, build, and optimize data pipelines using Py Spark on Dataproc• Develop performant, maintainable Spark jobs using Python, with a strong focus on reliability and cost efficiency• Manage Dataproc clusters, including provisioning, tuning, autoscaling, and ephemeral cluster usage• Design end-to-end data architectures from ingestion to analytics and downstream consumption• Collaborate with data consumers, platform teams, and stakeholders to deliver scalable solutions• Ensure data quality, observability, and operational excellence in production environments

Skill Requirements

Required Skills & Experience Core Skills: Py Spark & Dataproc• Strong expertise in Python, with extensive hands-on experience using Py Spark• Deep experience developing, tuning, and optimizing Spark batch and streaming workloads• Practical experience with Google Cloud Dataproc, including: o Cluster lifecycle managemento Initialization actions and custom configurationso Autoscaling policies and cost optimizationo Use of ephemeral clusters for job-based execution• Solid understanding of Spark internals (execution plans, caching, partitions, joins, shuffles, checkpointing)Google Cloud Platform (GCP)• Strong working experience with core GCP services, including: o Big Query for analytics and data warehousingo Google Cloud Storage (GCS) as a data lakeo Cloud Run for containerized data services and microserviceso Cloud SQL for relational and transactional workloadso Pub/Sub for event-driven and streaming ingestion• Familiarity with IAM, service accounts, and secure service-to-service communication Programming Languages• Advanced proficiency in Python for production data pipelines• Experience with Scala and/or Java for Spark development is a plus• Ability to write clean, testable, and well-documented code Data Storage & Processing• Proven experience designing data lakes on GCS, including: o Partitioning strategies and lifecycle managemento Optimized file formats such as Parquet and Avro• Strong experience integrating Spark pipelines with Big Query• Knowledge of data modeling concepts for analytics and reporting Workflow Orchestration• Experience orchestrating pipelines using: o Apache Airflow (Cloud Composer), oro Native Dataproc job submissions and workflow templates• Familiarity with monitoring, alerting, retries, and dependency management Data Pipeline Design• Strong experience designing and developing end-to-end data pipelines• Ability to build scalable, fault-tolerant, and maintainable systems• Hands-on experience implementing data validation, error handling, logging, and monitoring• Experience working with both batch and streaming processing patterns Streaming & Event Driven Processing• Hands-on experience with streaming data pipelines• Practical understanding of event-based ingestion and near real-time processing

Other Requirements

1.Relevant certifications in apache spark, scala, or python are a plus

Required skills

Apache Spark

Scala

Python

About HCL Technologies

HCL Technologies

Ambattur

Headquarters