About the Role :

- Design and build scalable ETL pipelines on GCP using Python (Cloud Run, Dataflow) for multiple data sources (API, CSV, XLS, JSON, SDMX).

- Perform schema mapping & data modeling using LLM-based auto-schematization; define Statistical Variables and generate MCF/TMCF.

- Implement entity resolution and standardized ID generation.

- Integrate curated data into the Knowledge Graph with proper versioning and governance.

- Develop and maintain REST & SPARQL APIs using Apigee.

- Ensure data quality, validation, and anomaly detection; troubleshoot ingestion issues.

- Drive automation and optimization by partnering with Automation and Managed Service PODs.

- Collaborate with cross-functional teams and stakeholders.

Mandatory Skills :

- Strong hands-on experience with Google Cloud Platform (GCP) :

- Cloud Storage, Cloud SQL, Cloud Run, Dataflow/Apache Beam, Pub/Sub, BigQuery, Apigee

- Python & SQL for data engineering and pipeline development.

- Proven expertise in data wrangling, cleaning, and transformation across varied data formats.

- Experience with Git-based version control.

Additional Skills :

- Knowledge of data modeling, schema design, and knowledge graphs (Schema.org, RDF, SPARQL, JSON-LD).

- Familiarity with CI/CD (Cloud Build) and Agile delivery.

- Strong problem-solving skills and attention to detail.

Preferred Qualifications :

- Experience with LLM-based data automation / auto-schematization.

- Exposure to large-scale public or open dataset integrations.

- Experience handling multilingual datasets