Job : PySpark/Databricks Engineer
Open for Multiple Locations with WFO and WFH
Job Description :
We are looking for a PySpark solutions developer and data engineer that is able to design and build solutions for one of our Fortune 500 Client programs, which aims to build a data standardized and curation-based Hadoop cluster
This high visibility, fast-paced key initiative will integrate data across internal and external sources, provide analytical insights, and integrate with the customer s critical systems
Key Responsibilities :
- Ability to design, build and unit test applications on Spark framework on Python.
- Build PySpark based applications for both batch and streaming requirements, which will require in-depth knowledge on majority of Hadoop and NoSQL databases as well.
- Develop and execute data pipeline testing processes and validate business rules and policies.
- Optimize performance of the built Spark applications in Hadoop using configurations around Spark Context, Spark-SQL, Data Frame, and Pair RDDs.
- Optimize performance for data access requirements by choosing the appropriate native Hadoop file formats (Avro, Parquet, ORC etc) and compression codec respectively.
- Ability to design build real-time applications using Apache Kafka Spark Streaming
- Build integrated solutions leveraging Unix shell scripting, RDBMS, Hive, HDFS File System, HDFS File Types, HDFS compression codec.
- Build data tokenization libraries and integrate with Hive Spark for column-level obfuscation
- Experience in processing large amounts of structured and unstructured data, including integrating data from multiple sources.
- Create and maintain integration and regression testing framework on Jenkins integrated with BitBucket and/or GIT repositories
- Participate in the agile development process, and document and communicate issues and bugs relative to data standards in scrum meetings
- Work collaboratively with onsite and offshore team.
- Develop review technical documentation for artifacts delivered.
- Ability to solve complex data-driven scenarios and triage towards defects and production issues
- Ability to learn-unlearn-relearn concepts with an open and analytical mindset
- Participate in code release and production deployment.
- Challenge and inspire team members to achieve business results in a fast paced and quickly changing environment
- BE/B.Tech/ B.Sc. in Computer Science/Statistics, Econometrics from an accredited college or university.
- Minimum 3 years of extensive experience in design, build and deployment of PySpark-based applications.
- Expertise in handling complex large-scale Big Data environments preferably (20Tb+).
- Minimum 3 years of experience in the following: HIVE, YARN, HDFS preferably on Hortonworks Data Platform.
- Good implementation experience of OOPS concepts.
- Hands-on experience writing complex SQL queries, exporting, and importing large amounts of data using utilities.
- Ability to build abstracted, modularized reusable code components.
- Hands-on experience in generating/parsing XML, JSON documents, and REST API request/responses
Did you find something suspicious?
Posted By
Posted in
Data Analytics & BI
Functional Area
Backend Development
Job Code
1510417
Interview Questions for you
View All