Role Overview :

We are looking for an experienced Python Data Engineer with strong expertise in PySpark, distributed data engineering, and LLM integration.

The role involves building scalable data pipelines, AI-powered workflows, and enabling data-driven automation using platforms such as Databricks, AWS EMR, and LangChain.

As an SE3 engineer, you will primarily be responsible for hands-on development and delivery, while also contributing to solution design and collaborating with cross-functional teams.

Key Responsibilities :

- Develop and optimize ETL pipelines using PySpark, SparkSQL, and distributed frameworks.

- Work with LangChain to integrate LLM-based solutions into data workflows (Agents, Toolkits, Vector Stores).

- Implement data transformations, lineage, and governance controls in data platforms.

- Support ML teams in deploying embeddings, retrieval-augmented generation (RAG), and NLP pipelines.

- Build workflows on Databricks and AWS EMR, ensuring cost and performance efficiency.

- Apply best practices in coding, testing, CI/CD, and documentation.

Required Skills :

- Strong proficiency in Python for data engineering.

- Hands-on experience in PySpark, SparkSQL, and ETL design.

- Working knowledge of LangChain, OpenAI APIs, or Hugging Face.

- Experience with Databricks, AWS EMR, or similar cloud data platforms.

- Good understanding of SQL, data modeling, and distributed data systems.

Good-to-Have Skills :

- Familiarity with Google ADK Prompt Engineering.

- Experience with vector databases like FAISS, Pinecone, or Chroma.

- Exposure to MLflow, Unity Catalog, or SageMaker.

- Interest in LLM-powered applications and generative AI workflows