About NStarX :

NStarX is an AI-first, Cloud-first engineering services provider built and led by practitioners. We specialize in transforming businesses through cutting-edge technology solutions. With years of expertise, we deliver scalable, data-driven systems that empower our clients to make smarter, faster decisions.

Role Summary :

We are seeking a Software Engineer (Data Engineering) who can seamlessly integrate the roles of a Data Engineer and Data Scientist.

The ideal candidate will design robust data pipelines, build AI/ML models, and deliver data-driven insights that address complex business challenges.

This is a client-facing role requiring close collaboration with US-based stakeholders, and the candidate must be flexible to work in alignment with US time zones when needed.

Key Responsibilities :

Data Engineering :

- Design, build, and maintain scalable ETL/ELT pipelines for large-scale data processing.

- Develop and optimize data architectures supporting analytics and ML workflows.

- Ensure data integrity, security, and compliance with organizational and industry standards.

- Collaborate with DevOps teams to deploy and monitor data pipelines in production environments.

Data Science & AI/ML :

- Build predictive and prescriptive models leveraging AI/ML techniques.

- Develop and deploy machine learning and deep learning models using TensorFlow, PyTorch, or Scikit- learn.

- Perform feature engineering, statistical analysis, and data pre-processing.

- Continuously monitor and optimize models for accuracy and scalability.

- Integrate AI-driven insights into business processes and strategies.

Client Interaction :

- Serve as the technical liaison between NStarX and client teams, ensuring clear communication and alignment on deliverables.

- Participate in client discussions, requirement gathering, and design reviews.

- Provide status updates, insights, and recommendations directly to client stakeholders.

- Work flexibly with customers based on US time zones to support real-time collaboration and delivery.

Required Qualifications :

- Experience : 4+ years in Data Engineering and AI/ML roles.

- Education : Bachelor's or Master's degree in Computer Science, Data Science, or a related field.

Technical Skills (Necessary) :

- Languages/Libraries : Python, SQL, Bash, PySpark, Spark SQL, boto3, pandas

- Compute : Apache Spark on EMR (driver/executor model, sizing, dynamic allocation)

- Storage : Amazon S3 (Parquet), lifecycle to Glacier

- Catalog : AWS Glue (Catalog & Crawlers)

- Orchestration/Serverless : AWS Step Functions, AWS Lambda, Amazon EventBridge

- Ingestion : CloudWatch Logs/Metrics, Kinesis Data Firehose (or Kafka/MSK)

- Warehouse : Amazon Redshift + Redshift Spectrum

- Security/Access : IAM (least privilege), Secrets Manager / SSM

- Ops/Collab : Git + CI (Jenkins/GitHub/GitLab), CloudWatch logging/metrics

Nice to Have :

- Scala, Docker, Kubernetes (Spark-on-K8s), k9s

- Fast stores (DynamoDB/MongoDB/Redis) for side lookups/indices

- Databricks, Jupyter

- FinOps exposure (cost baselines, dashboards)

Core Skills (Hands-on Responsibilities) :

Data Lake to Data Mart Design :

- Design layered data lake to data mart models (raw - processed - merged - aggregated).

- Implement hive-style partitioning (year/month/day), with retention and archival strategies.

- Define schema contracts, decision logic, and state machine handoffs.

Spark ETL Development :

- Author robust PySpark/Scala jobs for parsing, flattening, merging, and aggregation.

- Tune performance via broadcast joins, partition pruning, and shuffle control.

- Implement atomic, overwrite-by-partition writes and idempotent operations.

Warehouse Synchronization :

- Perform idempotent DELETE+INSERT/MERGE into Redshift using enumerated partition filters.

- Maintain audit-friendly SQL (deterministic predicates, counts of deleted/inserted/affected rows).

Data Quality, Reliability & Observability :

- Build repeatable, scalable, automated ETL pipelines with idempotency and cost efficiency.

- Implement schema drift checks, duplicate prevention, and partition reconciliation.

- Monitor EMR/K8s lifecycle, cluster right-sizing, and cost tracking (FinOps awareness).

Ingestion & Storage :

- Build log/event pipelines (CloudWatch/Kinesis/Firehose) into S3 using gzip + date partitions.

- Manage bucket layout, lifecycle rules (hot - Glacier), and data catalog consistency.

- Understand compression types (GZip, Snappy) and Hive-style directory structures.

Orchestration & Automation :

- Implement AWS Step Functions with Choice/Map/Parallel, retries, and backoff mechanisms.

- Automate scheduling via Event Bridge and deploy guardrail Lambdas.

- Parameterize pipelines for environments (dev/stage/prod) and selective recomputations.

Soft Skills :

- Strong analytical and problem-solving capabilities.

- Excellent communication for client engagement and stakeholder presentations.

- Proven ability to work flexibly with global teams, especially US-based customers.

- Team-oriented, proactive, and adaptable in fast-paced environments.

Preferred Qualifications :

- Experience with MLOps and end-to-end AI/ML deployment pipelines.

- Knowledge of NLP and Computer Vision.

- Certifications in AI/ML, AWS, Azure, or GCP.

Benefits :

- Competitive salary and performance-based incentives.

- Opportunity to work on cutting-edge AI/ML projects.

- Exposure to global clients and international project delivery.

- Continuous learning and professional development opportunities.