
Senior Apache Spark Technical Lead - Scala, Python
About the role
Job Summary
Job Description: Senior Data Engineer (Py Spark / Dataproc / GCP) We are looking for a Senior Data Engineer with strong hands-on expertise in Python (Py Spark) and Google Cloud Dataproc to design, develop, and operate scalable data pipelines on Google Cloud Platform. This role focuses on building reliable, production-grade data solutions across batch and streaming use cases.
Key Responsibilities
Key Responsibilities• Design, build, and optimize data pipelines using Py Spark on Dataproc• Develop performant, maintainable Spark jobs using Python, with a strong focus on reliability and cost efficiency• Manage Dataproc clusters, including provisioning, tuning, autoscaling, and ephemeral cluster usage• Design end-to-end data architectures from ingestion to analytics and downstream consumption• Collaborate with data consumers, platform teams, and stakeholders to deliver scalable solutions• Ensure data quality, observability, and operational excellence in production environments
Skill Requirements
Required Skills & Experience Core Skills: Py Spark & Dataproc• Strong expertise in Python, with extensive hands-on experience using Py Spark• Deep experience developing, tuning, and optimizing Spark batch and streaming workloads• Practical experience with Google Cloud Dataproc, including: o Cluster lifecycle managemento Initialization actions and custom configurationso Autoscaling policies and cost optimizationo Use of ephemeral clusters for job-based execution• Solid understanding of Spark internals (execution plans, caching, partitions, joins, shuffles, checkpointing)Google Cloud Platform (GCP)• Strong working experience with core GCP services, including: o Big Query for analytics and data warehousingo Google Cloud Storage (GCS) as a data lakeo Cloud Run for containerized data services and microserviceso Cloud SQL for relational and transactional workloadso Pub/Sub for event-driven and streaming ingestion• Familiarity with IAM, service accounts, and secure service-to-service communication Programming Languages• Advanced proficiency in Python for production data pipelines• Experience with Scala and/or Java for Spark development is a plus• Ability to write clean, testable, and well-documented code Data Storage & Processing• Proven experience designing data lakes on GCS, including: o Partitioning strategies and lifecycle managemento Optimized file formats such as Parquet and Avro• Strong experience integrating Spark pipelines with Big Query• Knowledge of data modeling concepts for analytics and reporting Workflow Orchestration• Experience orchestrating pipelines using: o Apache Airflow (Cloud Composer), oro Native Dataproc job submissions and workflow templates• Familiarity with monitoring, alerting, retries, and dependency management Data Pipeline Design• Strong experience designing and developing end-to-end data pipelines• Ability to build scalable, fault-tolerant, and maintainable systems• Hands-on experience implementing data validation, error handling, logging, and monitoring• Experience working with both batch and streaming processing patterns Streaming & Event Driven Processing• Hands-on experience with streaming data pipelines• Practical understanding of event-based ingestion and near real-time processing
Other Requirements
1.Relevant certifications in apache spark, scala, or python are a plus
Required skills
Apache Spark
Scala
Python
About HCL Technologies
Ambattur
Headquarters