Key Responsibilities and Skills :
GCP Data Services :
- Experience with various GCP data services such as BigQuery, Dataflow, Dataproc, Cloud Storage, Pub/Sub, and Data Fusion.
PySpark Development :
- Designing, developing, and optimizing PySpark applications for large-scale data processing, including ETL (Extract, Transform, Load) operations, data transformations, and aggregations on distributed datasets.
Data Pipeline Design :
- Building robust and scalable data pipelines for data ingestion, processing, and delivery, often incorporating real-time and batch processing requirements.
Performance Optimization :
- Tuning PySpark jobs and GCP data services for optimal performance, cost-efficiency, and resource utilization.
Data Modeling and Architecture :
- Understanding data modeling principles, designing efficient data schemas, and contributing to overall data warehouse and lake architectures on GCP.
Python Proficiency :
- Strong programming skills in Python for scripting, automation, data manipulation (using libraries like Pandas), and integrating with GCP services.
SQL Expertise :
- Proficient in SQL for querying data in BigQuery and other data sources, as well as for data validation and analysis.
Troubleshooting and Monitoring :
- Identifying and resolving issues in data pipelines, monitoring data quality and pipeline health, and implementing error handling mechanisms.
Collaboration and Communication :
- Working effectively with data scientists, analysts, and other stakeholders to understand data requirements and deliver solutions.
Version Control and DevOps :
- Familiarity with version control systems (e.g., Git) and applying DevOps practices for continuous integration and deployment of data solutions.