- *Team:** Data Platform / Engineering
- *Experience** : 5\-7 Years of Experience
At Apna, data is central to how we build products, understand users, improve employer outcomes, power recommendations, and scale decision\-making. This role gives you the opportunity to build the backbone of Apna’s data platform and influence how data is used across the company.
You will work on real\-world, high\-scale problems across jobs, users, employers, communities, matching, growth, and AI\-driven systems.
Apna is looking for a **Lead / Staff Data Engineer** to build and scale our core data platform. This role will work on large\-scale data pipelines, lakehouse architecture, query platforms, workflow orchestration, and data reliability systems that power analytics, product intelligence, machine learning, business dashboards, experimentation, and operational decision\-making across Apna.
We are looking for someone who can think deeply about **data architecture**, design reliable pipelines, improve data quality, and help build a platform that can scale with Apna’s growth.
You will be responsible for designing, building, and operating critical parts of Apna’s data platform, including:
- Building scalable batch and near\-real\-time data pipelines across product, business, growth, and ML use cases.
- Designing and improving our lakehouse architecture using technologies like**Apache Hudi**.
- Working with query engines such as**Presto / Trino**for large\-scale analytical workloads.
- Building and maintaining orchestration workflows using**Apache Airflow**.
- Creating reusable data models, curated datasets, and reliable data marts for analytics and product teams.
- Improving data platform reliability, observability, SLA tracking, lineage, and data quality checks.
- Optimizing storage, compute, query performance, and pipeline costs.
- Partnering with product, analytics, ML, and backend engineering teams to understand data needs and convert them into scalable platform solutions.
- Driving engineering standards around data modeling, schema evolution, partitioning, deduplication, backfills, replayability, and pipeline ownership.
- Mentoring data engineers and influencing architecture decisions across teams.
- *What We’re Looking For**
- Strong experience in**data engineering**, preferably at scale.
- Hands\-on experience with**Apache Airflow**or similar orchestration systems.
- Strong knowledge of**Presto / Trino**or other distributed query engines.
- Good understanding of**Apache Hudi**concepts such as:
- + Copy\-on\-write vs merge\-on\-read
+ Upserts and deletes
+ Incremental reads
+ Compaction
+ Clustering
+ Timeline and commits
+ Schema evolution
+ Partitioning strategy
- Strong knowledge of distributed data processing and storage systems.
- Ability to design and build reliable ETL / ELT pipelines.
- Strong SQL skills and ability to debug complex data issues.
- Good understanding of different data architectures, including:
- + Data warehouse
+ Data lake
+ Lakehouse
+ Lambda architecture
+ Kappa architecture
+ Medallion architecture
+ Event\-driven data architecture
- Experience with data modeling for analytics and reporting.
- Strong programming skills in at least one language such as**Python, Java, or Scala**.
- Ability to reason about trade\-offs between freshness, cost, reliability, latency, and complexity.
- Strong debugging and production ownership mindset.
- Experience with Kafka, Spark, Flink, Hive, Iceberg, Delta Lake, or BigQuery.
- Experience building internal data platforms or self\-serve data infrastructure.
- Experience with data quality frameworks such as Great Expectations, Deequ, Soda, or custom validation systems.
- Exposure to ML feature pipelines or feature stores.
- Experience with metadata management, data catalogs, lineage, and governance.
- Experience with cloud infrastructure such as AWS, GCP, or Azure.
- Understanding of privacy, compliance, PII handling, and access control in data systems.
- *What Success Looks Like**
- *In this role, success means:**
- Critical business and product datasets are reliable, discoverable, and trusted.
- Pipelines are observable, recoverable, and have clear SLAs.
- Query performance improves across major analytical workloads.
- Data freshness and quality issues reduce significantly.
- Teams can build on top of the data platform faster without reinventing pipelines.
- The platform can scale with Apna’s user, job, employer, and engagement data.