DEV Community: DataDriven

Data Engineer Salaries in 2026: The Numbers Are Lying

DataDriven — Thu, 25 Jun 2026 10:07:39 +0000

Last year I was helping a friend prep for senior data engineer interviews. He'd been building pipelines at a Series B for four years, solid production experience, and wanted to know what number to put in the salary field. So he did what everyone does: checked Glassdoor, Indeed, PayScale, and Levels.fyi.

He got four numbers. They disagreed by over $120,000.

Glassdoor said $133K. PayScale said $100K. Indeed's senior number sat at $216K. Levels.fyi split the difference at $157K. Same title, same country, same year; four answers that can't all be right. And here's the thing: none of them are lying. They're just counting different people, over different time windows, with different biases baked in. The result is that candidates trying to benchmark their data engineer salary are pricing themselves against a number that doesn't represent their actual market.

This is a problem. In a hiring environment where 52,050 tech workers got laid off in Q1 2026 alone, where senior roles take 60 to 90 days to fill, and where title inflation has made "data engineer" mean three different jobs depending on who's posting, getting your number wrong has real career cost. You either leave $30K on the table or you overshoot and get ghosted. Both outcomes trace back to the same root cause: the data you're benchmarking against is broken.

Why Every Salary Site Disagrees by $120K

Each major compensation source has its own rot problem. Understanding the bias is more useful than trusting the number.

Glassdoor reports $133,484 average across 32,984 submissions. The issue: it's entirely self-reported, and higher earners submit more frequently. The person who just got a $180K offer is more motivated to log it than the person who accepted $115K and moved on. The sample skews up.

PayScale reports roughly $100K. That sounds low because it is; 71% of their data engineering respondents are mid-level or junior. PayScale validates every data point and refreshes on sub-90-day cycles, which makes it the most accurate floor for what actually clears at offer stage. But candidates see $100K and panic. They shouldn't. They're looking at a junior-weighted average.

Indeed sits at $216K for senior roles. The problem here is temporal: Indeed averages job postings going back 36 months. Their June 2026 number includes postings from June 2023, before the layoff waves, before signing bonus compression, before the market shifted. You're benchmarking against fossil data.

Levels.fyi pegs the median at $157,450, but this population skews heavily toward top-tier tech companies and excludes non-tech firms where data engineers earn 20 to 35% less. Google's median is $278K. Capital One's is $130K. That's a $148K spread for the same title on the same platform.

The salary data isn't wrong. It's measuring different populations, different time windows, and different definitions of the job. Once you know which population you're in, the number becomes useful. Until then, it's noise.

The practical damage is real. A mid-market data engineer sees Glassdoor's $133K, anchors there, and never learns that the number includes FAANG outliers pulling the average up. Or worse, they see Indeed's $216K senior figure and counter-offer at a number that makes the hiring manager close the tab.

Role Title Chaos Is Pricing You Against the Wrong Pool

Here's the less obvious problem: the job you're benchmarking might not even be the job you're doing.

Analytics engineers earn $155K to $195K median in 2026. ML engineers command a 38% salary premium over data engineers at mid-career. Data scientists occupy yet another band. These are different roles with different compensation structures. But companies routinely mislabel them.

Analytics engineer postings grew 114% from 2023 to 2024, yet dbt Labs openly admits the title boundaries are blurring. Analysts drift into dbt modeling. Data engineers adopt dbt as standard tooling. The result: a "$150K dbt role" could be transformation work (analytics engineer) or pipeline infrastructure (data engineer), and the salary sites have no idea which one they're counting.

37,000 data engineering jobs post monthly on average, but a significant portion of those are mislabeled analytics engineer, ML engineer, or data scientist roles. When a company posts "Senior Data Engineer" but the job is really dbt plus Snowflake plus stakeholder dashboards, that's an analytics engineer role at data engineer pricing. The candidate benchmarks against infrastructure DE salaries ($115K to $160K) when they should be benchmarking against analytics engineer salaries ($155K to $195K). That's a $30K to $40K miss.

The reverse kills you too. An analytics engineer who sees ML engineer salary data and anchors at $190K gets rejected as "overpriced" for the actual scope.

The litmus test isn't the title. It's the job description. If it says dbt, Snowflake, and "stakeholder reporting," you're an analytics engineer regardless of what the posting calls you. Benchmark accordingly.

2023 Job Ads Are Still Haunting Your 2026 Number

Indeed's 36-month lookback window deserves its own section because the implications are worse than they look.

In 2023, the median data engineer salary was $117,446. By June 2026, Indeed reports $136,776. That looks like 16.5% growth over three years, which isn't terrible. But the number is being held down by every posting from 2023 and 2024 that's still sitting in the average.

Here's what makes this especially misleading: 68% of tech job postings included explicit salary ranges in 2025, up from 45% in 2023. Pay transparency laws made the data more granular. But Indeed weights all 36 months equally. A vague salary range guess from a 2023 pre-transparency posting counts the same as a precise, legally mandated range from 2026. Higher sample size, stale composition.

Then there's the ghost job problem. One-third of employers admit to posting inactive roles. Greenhouse data found 18 to 22% of listings are never filled. Stale 2023 postings are more likely to be dormant, and they're inflating the denominator. You're benchmarking against jobs that don't exist anymore.

The senior role divergence tells the real story. The $60K gap between mid-level ($133K) and senior ($175K) data engineers in 2026 suggests the market has repriced for experience. But the aggregate average is anchored by fossils. If you're mid-career, the number you see is artificially low.

Layoffs Created a Tier, Not a Glut

52,050 tech workers laid off in Q1 2026. A 40% jump over Q1 2025. Oracle cut 21,000. Amazon cut 16,000. Dell cut 11,000. Sounds like a buyer's market.

It's not. Or at least, not uniformly.

Those 52K cuts coexist with 67,000 active software engineering job postings in the same quarter, the highest posting volume in three years. Companies are cutting commoditized roles while hoarding data engineers, ML engineers, and security specialists. A junior full-stack engineer is in a buyer's market; a senior data engineer with Airflow and Spark experience is not. The "layoff market" narrative breaks down completely by skill tier.

But here's the asymmetry that actually matters: companies take 60 to 90 days to fill senior roles because they're running multiple candidates in parallel. Individual candidates spend 3 to 9 months searching. The employer can wait. The candidate runs out of severance. That's where negotiation leverage shifts; not because the market is soft, but because one side has a deadline and the other doesn't.

The data on negotiation is striking. Data engineers who negotiate earn $24,479 more annually, an 18.83% increase. 85% of counter-offers get at least partial acceptance. 70% of hiring managers expect you to negotiate. Only 44% of candidates actually do it. The $120K gap between salary sources is partly a measurement problem, sure. But it's also partly behavioral. The spread between 25th and 75th percentile reflects negotiation winners vs. passive accepters, not just market fragmentation.

Engineers with current cloud and security skills close offers in 2 to 4 weeks. Everyone else faces the full timeline. Skill specificity determines leverage more than market conditions.

What Number to Actually Put in the Field

Stop averaging the averages. Here's the hierarchy of sources, from most to least useful for your career planning:

Levels.fyi is best for FAANG and top-tier tech. Filter by company, level, and location. The by-company variance is massive ($278K at Google vs. $130K at Capital One), so the aggregate median is useless. You need the company-specific number.

Glassdoor is useful for the 25th to 75th percentile range at your target company, if they have enough submissions. The $141K to $219K senior DE range tells you more than the $175K mean.

PayScale is the most accurate floor. If you're at a non-tech company or early in your career, this is closer to your reality.

Indeed is the least useful for current benchmarking. The 36-month window buries the signal.

The actual number you put in the field should be the Levels.fyi or Glassdoor 75th percentile for the specific company, then negotiate. If 70% of hiring managers expect negotiation, pricing yourself at the median is pricing yourself to get negotiated down.

And one more thing the salary sites never show you: base salary is barely over half the employer's total cost to hire. That $200K base offer costs the company $240K to $290K when you add payroll tax, benefits, recruiting fees (18 to 25% of first-year base), and onboarding ramp. They have more room than you think. The question is whether you know enough about your own market to ask for it.

If you're prepping for the senior and staff loops where compensation actually diverges, strip back the "system design for software engineers" mentality; we built system design for data engineers with datadriven around pipeline architecture problems, not the load-balancer trivia that SWE prep loves and DEs never face on the job.

The salary data is broken. The titles are broken. The timelines are longer. None of that changes the fact that data engineering compensation is strong and growing for engineers who know what they're actually worth. The trick is figuring out which population you belong to, not which average to believe.

What's the biggest gap you've seen between what a salary site reported and what you actually earned or were offered?

Your Data Engineering Take-Home Is Now 20 Hours of Free Work

DataDriven — Tue, 23 Jun 2026 21:15:44 +0000

I got a take-home assignment last year from a company I was genuinely excited about. "Should take about four hours," the recruiter said. Build an ingestion pipeline, model the data, write tests, document your design decisions, and prepare a 15-minute presentation walkthrough for the panel. Four hours. I laughed, closed my laptop, and started on it the next morning like it was a sprint. Sixteen hours later I had something I was proud of. Clean pipeline, solid tests, real documentation. I submitted it on a Sunday night. Monday I got a form rejection. No notes. No feedback. Not even which stage I failed. Just "we've decided to move forward with other candidates" and a link to their Glassdoor page.

That was the moment I stopped pretending take-homes are assessments. They're consulting gigs. Unpaid ones.

The Scope Creep Nobody Talks About

Five years ago, a data engineering take-home was a focused exercise. Model this dataset into a star schema. Write a few SQL transforms. Maybe a short README. Two to four hours, tops. Bounded, reasonable, and actually useful for evaluating how someone thinks about data.

That version is dead.

Today, 68% of companies use take-home tests, up 12% year over year. And the scope has quietly ballooned into something unrecognizable. Full pipeline implementations. Test suites with coverage thresholds. Documentation that reads like a design doc. A presentation follow-up where you defend your architecture to a panel. We're talking 10 to 20 hours of work, routinely, for a role you haven't been offered.

Industry best practice caps take-homes at 90 minutes of expected effort. The reality? Candidates consistently take 2x longer than company estimates to reach submission quality. That "four-hour" assignment is an eight-hour assignment. That "weekend project" is a week of evenings. And 25% of companies are still handing these out like they're reasonable asks.

Here's the part that makes my eye twitch: 71% of engineering leaders openly say take-homes no longer generate useful signal. AI has degraded the format so completely that leaders themselves rate take-home signal as "degrading fastest" among all assessment types. They know it's broken. They keep doing it anyway.

The attempted fix is even worse. Companies panicked about AI usage and responded by inflating scope. The logic, if you can call it that: make the assignment so large that AI can't do it alone. Except longer assessments don't defeat AI; they defeat candidates. Candidates with kids. Candidates working full-time jobs. Candidates from non-traditional backgrounds who can't burn 20 hours on a maybe. One candidate documented spending 32 hours on a single assignment, then got rejected for omitting a feature that was never mentioned in the requirements. Another was asked to build a learning module that would've billed at $2,800 as freelance work.

A four-hour take-home is a fair test. A 20-hour take-home is free consulting dressed up as an interview.

59% of job seekers now say unpaid take-home assignments are the number one reason they won't apply. Not comp, not culture, not location. The assessment itself is the dealbreaker.

AI Banned, Rubrics Unchanged

Two thirds of companies ban AI use in their interview process. Sounds decisive. Except fewer than 30% of those companies have actually updated their assessments or retrained their interviewers. They slapped a "no AI" sticker on a 2015-era take-home and called it policy.

The enforcement gap is almost comical. One company measured 80% of candidates using LLMs on take-home tests despite an explicit prohibition. AI cheating on take-homes doubled from 15% to 35% between June and December 2025. In purely technical roles, 48% of candidates show signs of unauthorized AI use. The ban is a suggestion, not a guardrail.

Meanwhile, the rubrics these companies grade against were built to evaluate raw coding speed and syntax accuracy. Those signals collapsed the moment Claude could produce a clean solution in seconds. But nobody rewrote the rubric. Nobody redefined what "good" looks like when the baseline output quality shifted. Hiring managers score problem-solving and architecture judgment, but the assessment they hand out measures code-from-scratch, a skill that's now commodity.

The split in the industry tells you everything. Meta and Shopify openly invite AI tools into their assessments. They've decided to test "can you use AI well" rather than "can you code without it." Goldman Sachs and Amazon maintain hard bans for candidates while investing heavily in internal AI tools for their own engineers. The hypocrisy is so blatant it's almost impressive. You can't use AI to get hired here, but once you're in, you'd better use it or you're slow.

Banning AI in interviews creates a discontinuity between evaluation and production. In 2026, writing code without AI assistance is the exception, not the norm. You're testing candidates in an environment that doesn't reflect the environment they'll work in. That's not assessment; that's theater.

70% of You Will Never Hear Why

Here's the stat that should make every hiring manager uncomfortable: 69.7% of candidates receive zero feedback after rejection. Not "insufficient feedback." Zero. Nothing. A form email and silence.

61% of candidates report being ghosted entirely after interviews. No rejection, no closure. Just silence from a company that asked them to spend a weekend building a pipeline.

Companies hide behind legal risk. "We can't give feedback because candidates might sue." This is, to put it plainly, nonsense. Employment law distinguishes between subjective rejection reasons ("you seemed low-energy") and factual, role-specific feedback ("your schema migration approach didn't handle the edge case we were testing for"). The second type is almost litigation-proof. No engineer has successfully sued a company over constructive technical feedback. The legal defense is a myth that compliance teams perpetuate because "say nothing" is the lowest-variance strategy. It's organizational laziness wearing a legal costume.

The business case against silence is overwhelming. 79% of candidates would reapply to a company if they'd received feedback. Recruiters who share feedback see a 126% increase in candidate referrals. Companies withholding feedback aren't just being rude; they're burning bridges they'll need to cross again in 18 months when they're hiring for the same role.

But here's the real cruelty. When the assessment demands 10 to 20 hours, and the rejection carries zero feedback, you've extracted labor and returned nothing. Not compensation, not signal, not even a paragraph explaining what to work on. The candidate can't even reuse the learning because there is no learning. It's labor arbitrage dressed up as a career opportunity.

Only 17% of external candidates receive feedback, compared to 65% of internal candidates. If you already work there, you get a debrief. If you're on the outside spending your weekend on their assignment, you get a template. The double standard is institutional.

What Actually Works

The good news: some companies figured this out. The better news: it's not complicated.

Live debugging interviews, running 60 to 90 minutes, are replacing puzzles at companies like Cloudflare, Datadog, and GitHub. Candidates get a broken system. They debug it. Interviewers watch the process: how do you form a hypothesis, how do you narrow the search space, do you narrate your thinking. You're evaluated on engineering judgment, not memorization speed. A candidate who thinks aloud and corrects wrong hypotheses scores higher than one who guesses fast but can't explain why.

For senior and staff roles, pair programming on a debugging or refactoring task is the highest-signal round you can run. Forty-five minutes, real code, real collaboration. It surfaces the kind of judgment that 20-hour take-homes never could, because judgment shows up in conversation, not in a solo sprint nobody watches.

Uber runs a two-hour on-site schema critique instead of toy problems. Stripe bounds their take-homes to one to three hours with clear scope. Both companies report higher completion rates and better signal than the bloated formats they replaced.

The pattern is obvious: bounded time, realistic work, human interaction. If you want to know how someone debugs a broken DAG, hand them a broken DAG and watch. Don't ask them to build one from scratch over a weekend and then ghost them.

If you're a candidate stuck grinding through these loops, focus your prep on the concepts that transfer across every format: data modeling, pipeline architecture, query optimization. I've found that a resource like datadriven.io is good for etl interview questions if you want structured reps on the technical fundamentals without wading through another generic course. The game is arbitrary, but the concepts compound regardless of which format a company throws at you.

The System Knows It's Broken

72% of job seekers report negative mental health impacts from lengthy hiring processes. Candidate ghosting hit a three-year high in 2026. The market has 2.2 million fake openings monthly, candidates respond with AI-powered mass applications, companies respond by banning AI, and the entire system spirals further from producing any useful signal for anyone.

The profession acknowledges the assessment is unreliable while refusing to stop using it. This isn't a transitional phase. It's institutional paralysis. Companies would rather extract 20 hours of free work from someone they'll reject silently than spend 90 minutes on a live session that actually reveals how an engineer thinks.

I've been through enough of these loops to know the system doesn't reform itself. It changes when candidates refuse to participate and when hiring managers with enough authority say "this is stupid, let's stop." If you're in a position to design an interview process, bound the time, provide feedback, and evaluate how people think, not how much free labor they'll tolerate.

If you've done one of these 20-hour take-homes recently: what was the assignment, and did you hear anything back?

The 12 Data Modeling Interview Questions that Matter

DataDriven — Wed, 17 Jun 2026 10:04:25 +0000

I've watched candidates with 8 years of experience go blank when asked to define the grain of a fact table. Not because they're bad engineers; because nobody told them that data modeling is the actual filter. SQL problems test syntax. System design tests memorization. Data modeling tests whether you can think. That's why it's the section that separates senior from staff, and why interviewers keep leaning on it harder every cycle. AI can spit out a medium LeetCode solution in seconds; it still can't explain why your grain decision breaks downstream aggregates.

These 12 problems are the ones I've seen repeatedly across FAANG and late-stage startup loops. They cover star schema design, dimensional modeling tradeoffs, SCDs, late-arriving data, and the classification calls that trip up even experienced candidates.

Want to practice these for real? Solve these problems live here with a real editor and graded solutions.

1. Define the Grain of a Fact Table

The question: You're building an analytics warehouse for a ride-sharing company. Before designing any tables, state the grain of the core fact table. What does one row represent?

The answer is: one row per completed trip. Not per driver. Not per day. One row per atomic trip event, keyed by trip_id, with foreign keys to dim_driver, dim_rider, dim_pickup_location, dim_dropoff_location, and dim_date. Measures include fare_amount, tip_amount, trip_duration_seconds, trip_distance_miles.

Why it matters: Grain is the single most important decision in dimensional modeling. Candidates who jump into drawing tables without stating "one row represents X" are already drifting. Undefined grain causes silent metric inflation, duplicate rows, and join explosions that don't throw errors; they just produce wrong numbers. Interviewers test this first because everything downstream depends on it. The follow-up is always: "What happens when a trip has multiple stops?" If your grain assumed single-destination trips, you just broke your own schema.

2. Star Schema vs. Snowflake Schema

The question: Your team is building a new warehouse on Snowflake (the product). A junior engineer proposes snowflake schema (the design pattern) to save storage. Do you agree? Why or why not?

You don't agree. Star schema is the default for modern columnar warehouses. Snowflake, BigQuery, and Redshift compress denormalized dimensions so efficiently that snowflaking (normalizing dimensions into sub-tables) rarely saves meaningful storage anymore. The engineering overhead of maintaining normalized dimension hierarchies exceeds the storage cost of duplication. Star is the safe opening position in any interview.

Why it matters: Picking snowflake first signals junior thinking. The economics killed the normalization argument around 2024. Interviewers aren't testing whether you know both patterns exist; they're testing whether you can reason about the tradeoff. The follow-up: "When would you normalize a dimension?" The answer is when the dimension is enormous and changes frequently (millions of rows, daily updates), making the redundant writes expensive. That's rare.

3. Design a Fact Table for E-Commerce Orders

The question: Design a star schema for an e-commerce platform. The business needs to track orders at the line-item level for revenue analysis.

CREATE TABLE fact_order_line_item (
    order_line_item_sk  BIGINT PRIMARY KEY,
    order_id            BIGINT,      , degenerate dimension
    product_sk          BIGINT REFERENCES dim_product,
    customer_sk         BIGINT REFERENCES dim_customer,
    date_sk             INT    REFERENCES dim_date,
    quantity            INT,
    unit_price          DECIMAL(10,2),
    discount_amount     DECIMAL(10,2),
    line_total          DECIMAL(10,2)
);

Grain: one row per line item per order. order_id is a degenerate dimension; it lives in the fact table because it has no descriptive attributes worth a separate table.

Why it matters: This tests three things at once. Can you declare grain (line item, not order)? Do you know what a degenerate dimension is? And do you put the right measures in the fact table? Candidates who model at the order grain lose the ability to analyze product-level revenue without restructuring. You can always aggregate up from line items to orders; you can never disaggregate back down.

4. SCD Type 2 Implementation

The question: A customer changes their address. How do you model dim_customer to preserve the old address for historical reporting?

SCD Type 2: insert a new row with a new surrogate key, set effective_date and expiration_date on both rows, and flag the current row with is_current = TRUE. The original row stays intact; historical fact rows still join to the old address via the old surrogate key.

,  Before change
,  customer_sk=101, name='Jane', city='Austin', effective='2024-01-01', expiration='9999-12-31', is_current=TRUE

,  After change: close old row, insert new
UPDATE dim_customer SET expiration_date = '2026-06-16', is_current = FALSE WHERE customer_sk = 101;
INSERT INTO dim_customer VALUES (102, 'Jane', 'Denver', '2026-06-17', '9999-12-31', TRUE);

Why it matters: SCD2 is a separator question. Juniors describe it from the textbook. Seniors bring up the trap: SCD2 row explosion. A dimension with 10M rows tracking frequently changing attributes can balloon to 150M rows in five years. The follow-up is always: "When would you use Type 1 instead?" Answer: when the business doesn't need the history. A corrected typo in a customer name doesn't warrant a new historical row. Type 1 overwrites are often correct, despite Type 2's prestige.

5. Late-Arriving Dimensions

The question: An order fact arrives, but the customer who placed it hasn't been loaded into dim_customer yet. What do you do?

Insert a placeholder row in dim_customer with a surrogate key and all descriptive columns set to "Unknown" or null. The fact row joins to this placeholder. When the real customer data arrives, you overwrite the placeholder via Type 1 update.

Why it matters: Late-arriving dimensions and late-arriving facts are entirely different problems, and mixing them up is an instant red flag. This tests whether you understand that the fact table can't wait; it needs a foreign key now. The alternative (dropping the fact row until the dimension arrives) loses data. The follow-up: "What if the dimension arrives with changes?" Then you might need to apply SCD2 logic to the placeholder row, which gets complex fast.

6. Bridge Tables for Many-to-Many Relationships

The question: A hospital system tracks patient diagnoses. One hospitalization can have multiple diagnoses, and one diagnosis applies to many hospitalizations. How do you model this?

You use a bridge table. Create a diagnosis_group_key that maps to a set of diagnoses in bridge_diagnosis. The fact table (fact_hospitalization) joins to diagnosis_group_key; the bridge table resolves each group to individual dim_diagnosis rows.

Why it matters: Many-to-many relationships in dimensional models are the source of the most dangerous bug in analytics: double-counting. Without a bridge table, a naive join between fact and dimension multiplies rows. Interviewers use this to test whether you understand the cardinality trap. The follow-up: "How do you handle weighting?" If a hospitalization has three diagnoses, does each get 1/3 of the revenue allocation? Bridge tables can carry a weight_factor column for exactly this.

7. Fact vs. Dimension Classification

The question: You have a column customer_lifetime_revenue. Is it a fact or a dimension attribute?

It's both, depending on usage. If you're summing it across rows, it's a fact. If you're banding it into ranges ("$0-$1K", "$1K-$10K") to filter or group by, it's a dimension attribute. Kimball calls this the aggregated-fact-as-attribute pattern.

If you would aggregate the column, it's a fact. If you would filter or group by it, it's a dimension. That's the whole test.

Why it matters: This exposes whether a candidate understands that the fact/dimension boundary isn't about data types. Numeric columns don't automatically belong in fact tables. The follow-up: "Where do you physically store it?" Usually in the dimension, banded into a descriptive range, with the raw number available as an additive fact if needed.

8. Factless Fact Tables

The question: The business wants to know which products were NOT sold in each store last month. How do you model this?

A factless fact table (coverage table). One row per store per product per month, representing eligibility. To find products not sold, you subtract the sales fact table from the coverage table.

Why it matters: Most candidates have never heard of factless fact tables. The name sounds like a contradiction. But they solve a real problem: you can't report on the absence of an event without first modeling what could have happened. Student attendance, product availability, promotional eligibility; these all use the same pattern. The follow-up: "Isn't this just a cross join?" Yes, and that's the point. The cross join defines the universe; the anti-join finds the gaps.

9. Accumulating Snapshot Fact Table

The question: Model an order fulfillment pipeline with stages: ordered, packed, shipped, delivered.

One row per order, with multiple date columns: order_date_sk, pack_date_sk, ship_date_sk, delivery_date_sk. The row gets updated as the order progresses through stages. Null dates indicate incomplete milestones.

Why it matters: This is the "advanced grain" question. Most candidates know transaction facts and periodic snapshots; accumulating snapshots trip them up because the row mutates. The fact table updates in place, which feels wrong if you've been taught that fact tables are append-only. Insurance claims, hiring workflows, procurement cycles; all use this pattern. The follow-up: "How do you handle an order that skips a stage?" That's a null in the milestone column, and your reporting logic needs to handle it.

10. Conformed Dimensions

The question: Sales and marketing each have their own dim_customer table with different definitions. What's the risk, and how do you fix it?

The risk is the CEO gets two different customer counts. Conformed dimensions are shared across fact tables and business units, with identical keys, attributes, and definitions. You build one dim_customer, owned by a central data team, and both domains join to it.

Why it matters: This tests organizational thinking, not just schema design. Split-brain dimensions are how companies end up with "which number is right?" meetings. The follow-up: "What if the two teams need different attributes?" Add them to the same dimension. A wide dimension with 50 columns that both teams trust is better than two narrow dimensions that contradict each other.

11. Normalization vs. Denormalization for Analytics

The question: When would you choose a normalized (3NF) model over a denormalized star schema in an analytics warehouse?

Almost never for the presentation layer. Denormalized schemas achieve 20 to 100x faster query performance on complex analytics workloads by eliminating joins. BigQuery benchmarks show 49% average improvement with fully denormalized tables compared to star schemas. But the staging layer should stay normalized. 3NF in staging preserves flexibility; when requirements change, you can rematerialize the presentation layer without remodeling the entire pipeline.

Why it matters: The real answer is "both, in different layers." Organizations run 3NF in source systems, normalize in staging for integrity, and denormalize in the presentation layer for speed. Candidates who pick one paradigm for the entire warehouse reveal they've never dealt with a schema migration. The follow-up: "What about high-cardinality many-to-many relationships?" Don't denormalize those. A customer/orders/products grain creates explosive row multiplication.

12. Late-Arriving Facts and Backfills

The question: Your daily pipeline processes orders by processing_date. An order from 10 days ago arrives today. How does your pipeline handle it?

Partition by event_time (when the order was placed), not processing_time (when it arrived). Keep a rolling recompute window open; reprocess the last 14 days on every run. This auto-reconciles normal late arrivals without manual intervention. For data outside the window, run an explicit backfill job.

Why it matters: Late data isn't a failure mode; it's the normal case. Most production systems expect 10 to 20% of daily volume to arrive delayed. Candidates who say "drop anything older than 7 days" have never worked on a pipeline that finance depends on. The follow-up: "What if the late fact needs to join to a dimension that has since changed (SCD2)?" You join to the dimension version that was active at event time, not processing time. That's the whole point of surrogate keys and effective dates.

, -

Data modeling questions keep showing up because they're the one thing AI can't fake for you. An LLM will produce a schema. It won't explain why that grain breaks when requirements shift, or defend the denormalization when the interviewer pushes back. If you want structured reps on these exact patterns, i used DataDriven for data modeling interview questions and it was the most efficient prep I found for this category.

Which data modeling interview question would you add to this list? I'm curious what y'all are seeing in loops right now.

80% of DE Candidates Use AI on Take-Homes. Companies Can't Stop It.

DataDriven — Tue, 09 Jun 2026 10:05:55 +0000

I've been on both sides of the hiring table for data engineering roles. I've given take-homes, graded take-homes, argued with other panelists about take-homes, and done my share of them as a candidate. So when I tell you the entire system is broken in a way nobody wants to talk about honestly, I'm not theorizing. I watched it happen in real time.

Here's the situation: 64% of companies now prohibit AI tools in technical interviews. Meanwhile, 35% of candidates are using LLMs anyway, up from 15% just six months prior. In purely technical roles, that number climbs to 48%. And 61% of those candidates pass the approval threshold and advance without anyone noticing. The ban exists on paper. In practice, it's a suggestion that penalizes the people who follow it.

The Honest Candidate Tax

This is the part that actually pisses me off. If you're a data engineering candidate who follows the rules, who sits down with your take-home and writes your own SQL, builds your own pipeline, tests your own edge cases, you are now competing against people whose submissions were polished by an LLM in a fraction of the time. And the hiring team cannot tell the difference.

Cheaters have a roughly 3:1 pass rate advantage. That's not a guess; that's from Fabric's analysis of 19,368 interviews between July 2025 and January 2026. Candidates using AI tools scored above the 7.0 approval threshold 61% of the time. The honest candidates? They're producing slower, rougher, less polished work. Because that's what real human output looks like when you're solving an unfamiliar problem under time pressure.

It gets worse. Take-home assignments have ballooned. What used to be a 2 to 3 hour exercise is now routinely 10 to 20 hours of unpaid work. Full pipeline implementations, data modeling, documentation, testing, presentations. At that scope, using AI isn't just tempting; it's economically rational. You're asking someone to do a part-time job for free and then punishing them for using the most efficient tool available.

The 20-hour take-home created the cheating incentive. Companies shifted from live coding to extended take-homes to "reduce bias" and inadvertently built the perfect environment for undetectable AI assistance.

83% of candidates say they would use AI if they could get away with it. I'm honestly surprised the number is that low. The game theory here is a textbook prisoner's dilemma: if you assume your competition is cheating (and statistically, they are), following the rules is the losing move. Genuine candidates report feeling forced to cheat because they assume everyone else already is.

And the detection? It's theater. Some platforms claim 93% accuracy analyzing keystroke patterns and tab-switching behavior. But invisible overlay tools like Cluely and Interview Coder now render answers using DirectX and Metal at the OS level, completely invisible to screen sharing. A second device listening to interview audio works just as well. The detection arms race is over before it started.

The Ban That Nobody Can Enforce

Here's the double standard that makes this whole thing absurd: 64% of organizations using AI in HR apply it to recruiting and interviewing on their end. They're screening your resume with AI, generating interview questions with AI, scoring your responses with AI. But you, the candidate? You're banned from using AI. Because integrity.

Amazon explicitly disqualifies candidates caught using AI. Goldman Sachs told campus recruits they "must not use ChatGPT, Google, or any external AI assistance." Noble policies. Zero enforcement mechanism. Neither company has a reliable way to detect it. Enforcement depends on candidates self-reporting or failing live follow-ups.

71% of engineering leaders admit AI makes technical skills harder to assess. And yet 62% still prohibit it despite acknowledging they cannot detect violations. This isn't a policy; it's a prayer.

The detection tools themselves are worse than useless. AI detectors bundled into platforms like Turnitin and GPTZero are, by multiple 2026 analyses, "increasingly wrong" because candidates can prompt an LLM to generate novel solutions that plagiarism software flags as original work (because they are). False positive rates range from 1% to 30% depending on the tool. So you've got honest candidates getting flagged for coincidental code similarity while actual cheaters using invisible overlays sail through. The system protects liars better than truth-tellers.

The core problem isn't that AI is too good. It's that the problem is unsolvable at scale. A candidate can prompt GPT-4 to generate novel, non-plagiarized code for any assignment, and no static analysis can distinguish it from original work without access to the candidate's reasoning process. The only scalable detection is process visibility: pair programming, timestamped drafts, in-person walkthroughs. And companies resist all of those because they don't scale cheaply.

One company's response, when shown data that 80% of their take-home submissions used LLMs? They decided to ignore the cheating and just move top performers to the next round. That's not a hiring process. That's capitulation.

Three Companies, Three Opposite Bets on the Future

The industry hasn't converged on a solution. It's fractured into at least three incompatible approaches, and if you're job hunting in data engineering right now, you need to understand all of them.

The AI-required camp. Meta launched AI-enabled interviews in October 2025. Candidates work in CoderPad with access to GPT-4o, Claude, Gemini, or Llama. They're evaluated on AI fluency, prompt engineering, output validation, and debugging. The company plans to expand this to all backend and ops roles in 2026. Canva went further: they replaced their entire "Computer Science Fundamentals" interview with "AI-Assisted Coding" for backend, ML, and frontend roles. Candidates must use Copilot, Cursor, or Claude. The problems are designed so they can't be solved with a single prompt; they require iterative thinking and judgment.

The signal these companies are hiring for isn't "can you code without help." It's "can you direct AI correctly, catch its mistakes, and defend every architectural decision." Candidates who passed these rounds weren't better prompters. They knew what to build, caught what the AI got wrong, and could explain why.

The ban-and-hope camp. Amazon and Goldman Sachs sit here. Explicit prohibition, no reliable detection, trust-based enforcement. Less than 30% of companies that ban AI have actually retrained their interviewers to spot it. The policy exists to provide legal cover, not to change outcomes.

The hybrid camp. 41% of companies now pair a take-home with a synchronous defense session. You do the work at home (with whatever tools you actually use), then you sit down with an engineer for 30 minutes and explain it. This is where LLM help evaporates. If you can't walk through your own solution, modify it on the fly, and handle edge cases in conversation, the take-home score doesn't matter. It's spreading as the unspoken standard because it's the only format that actually tests what companies care about.

The red flags interviewers are learning to spot in those defense sessions: explanation-code mismatch (your spoken reasoning contradicts what you wrote), terminology beyond your demonstrated level (a junior suddenly discussing architectural patterns they can't elaborate on), and the tell-tale 3 to 5 second delay before every answer that suggests an overlay is generating responses in real time.

The Career Implications Nobody's Saying Out Loud

Entry-level data engineering roles are getting hammered the hardest. Junior candidate cheating nearly tripled, from 15% to 40% year over year. And junior candidates have the lowest detection risk because interviewers expect less fluency from them. A senior engineer dropping suspiciously polished system design answers raises eyebrows. A junior producing clean code? That just looks like a strong candidate.

This inverts the hiring funnel in a way that should terrify everyone. The most junior, least skilled cohort has the highest incentive to cheat and the best chance of getting away with it. They get hired. They can't do the job. The team absorbs the cost. And six months later, the same team posts the same req, runs the same broken process, and wonders why their pipeline keeps breaking.

Here's where it lands for your career. If you're a candidate: the interview is a game. It has always been a game. AI didn't make it arbitrary; it was already arbitrary. DS&A has always been a mechanism to rank candidates, not an indicator of data engineering experience. What changed is the rules of the game, and right now nobody agrees on what the rules are. So you need to prepare for all three formats. Know your fundamentals cold; not because a take-home requires it, but because the 30-minute live defense does. That's where the real hiring decision happens now.

If you're on a hiring panel: stop pretending your take-home ban is enforceable. It isn't. Either redesign around it (hybrid format, live defense, AI-collaborative sessions) or accept that you're selecting for candidates who are good at hiding AI use. That's a skill, sure. It's just not the one you think you're testing for.

82% of data professionals now use AI tools daily. We're banning in interviews the exact workflow we expect on the job. At some point, the industry has to reconcile those two facts.

The companies that figure this out first will hire the best engineers. The ones clinging to unenforceable bans will hire the best cheaters. Same resume, same score, very different outcome six months in.

What's your read? If you're interviewing right now, are you using AI on take-homes, and do you think the hybrid format (take-home plus live defense) actually solves the problem, or just moves it?

Your Data Engineering Take-Home Is Free Labor

DataDriven — Thu, 04 Jun 2026 10:07:59 +0000

I got sent a take-home assignment last year that asked me to build an end-to-end pipeline: ingest from three APIs, transform in Python, load to a warehouse, write tests, document my design decisions, and prepare a 15-minute presentation for "the team." The recruiter said it should take "about four hours." I timed myself. It took fourteen. I didn't get the job. I didn't get feedback. I got a form rejection three weeks later.

That's not an interview. That's a consulting engagement with a 0% billing rate.

The Scope Creep Nobody Talks About

Take-homes weren't always like this. The original pitch was reasonable: instead of whiteboard hazing where you reverse a linked list while someone watches you sweat, you get to work in your own environment, at your own pace, on something resembling real work. That was the deal. A couple hours, a focused problem, maybe a short discussion afterward.

Then companies got greedy.

The recommended best practice is still 2 to 4 hours. Over 80% of survey respondents believe take-homes should cap at four hours. But candidates consistently report spending 5x the stated estimate. Companies will write "don't spend more than 3 hours on this" at the top of a prompt that includes building a working MVP, writing a README with architecture docs, recording a demo, adding unit and integration tests, and documenting your trade-offs. That's not a 3-hour task. That's a small freelance project.

The scope expectations in 2026 are indistinguishable from paid contract work. One candidate reported being asked to create a 30-minute learning module with video, graphics, voiceover, and interactive elements. The estimated freelance market value? $2,800. For an "assessment."

Here's what's actually happening: data engineering take-homes have quietly evolved from "show us you can write SQL" to "build us a proof of concept we might actually use." And the line between those two things is the line between an interview and unpaid labor.

58% of engineers believe they deserve payment for take-homes. Only 4% receive it. Read those numbers again.

If a candidate invests 15 hours with a 10% chance of advancing, the expected return per hour is zero. That's not an interview process; that's a lottery where you pay with your weekend.

Free Consulting in Disguise

Let's talk about the part nobody wants to say out loud: some companies are using candidate submissions.

Indeed's own hiring research flags the concern directly: "Companies may even steal the ideas of candidates, use them, and not give credit or compensate the candidate." That's not a fringe take from a disgruntled Reddit poster. That's on a major job platform's hiring guide.

The structural problem is simple. When you ask a data engineer to build a pipeline that ingests your actual data format, transforms it according to your actual business logic, and loads it into your actual warehouse schema, you've crossed the line from evaluation to extraction. The candidate doesn't know if their code will ship or be discarded. The company doesn't disclose what happens to submissions. The information asymmetry is total.

And it gets worse. About 50% of job seekers strongly dislike take-homes and drop out entirely. But here's the paradox: Dropbox found that 20% of candidates abandoned their process before completing assignments, and the ones who dropped out were often the strongest candidates. They had competing offers. They had leverage. They didn't need to grind 15 hours for a maybe.

So who finishes? Candidates without alternatives. The desperate. The junior engineers without competing offers. The people who can't afford to say no. The take-home isn't filtering for talent; it's filtering for availability.

This is especially brutal for marginalized candidates. If you're working a second job, handling caregiving, or don't have reliable internet access, a 15-hour unpaid assignment isn't a minor inconvenience. It's a gate that has nothing to do with whether you can do the work. The equity fracture is real: unpaid take-homes create a class-based filter where financial stability determines who gets to compete.

76% of recruiters say take-homes improve hiring quality. Nobody surveyed the candidates to ask if they felt fairly assessed. Funny how that works.

Red Flags Before You Even Open the Repo

After doing somewhere around 20 interview loops in a single job search (some went well, some went laughably poorly), I've developed a pretty reliable radar for take-homes that are going to waste my time. Here's what I look for now.

No time estimate at all. If the prompt doesn't tell you how long they expect it to take, they either don't know or don't care. Both are bad. A company that can't scope a 3-hour exercise is telling you something about how they scope projects internally.

The deliverables list is longer than the problem statement. When the requirements section says "build a pipeline" but the deliverables section says "working code, tests, documentation, architecture diagram, trade-off analysis, recorded demo, and a 15-minute presentation," you're not being evaluated. You're being outsourced.

The data looks suspiciously like their actual data. Generic datasets (public APIs, sample CSVs) are fine. When the schema matches their product domain a little too closely, when the transformations feel like real business logic, that's not a coincidence. That's a proof of concept with plausible deniability.

No compensation and no timeline. The gold standard for a fair take-home is 90 minutes of focused work plus a 30-minute walkthrough. If they're asking for more than 4 hours and offering nothing in return, the math doesn't work in your favor. Labor experts agree: unpaid assignments exceeding 2 to 3 hours cross into territory where compensation is ethically and legally justified.

They ghost after submission. 8 in 10 hiring managers admit to ghosting candidates. If you're investing a full weekend into something, you deserve feedback. A company that can't write two paragraphs about why they passed on you after you wrote two thousand lines for them is telling you exactly how they'll treat you as an employee.

The AI angle makes all of this worse, not better. Cheating rates on take-homes jumped from 15% to 35% in six months. Companies are responding not by making the process fairer, but by making it harder and longer. Classic arms race. The format's legitimacy is collapsing in real time.

How to Push Back Without Burning the Bridge

Here's what I've learned the hard way about career management in the interview game: asking for scope clarity isn't weakness. It's the move that separates you from candidates who'll silently overcommit and resent it.

Before you open your IDE, send this: "I want to make sure I'm aligned with your expectations. I estimate this will take X hours based on the requirements. Is that consistent with what you've seen from other candidates?" That's it. Professional, direct, and it forces them to put a number on the record.

If the number they come back with is wildly different from your estimate, you have information. Either the scope is genuinely smaller than it looks (great, clarify what's optional) or they're lowballing the time estimate to avoid scaring you off (red flag).

If the assignment exceeds 4 hours, it's completely reasonable to ask about compensation. Buffer runs 45-day paid trial projects. Webflow uses 3 to 5 day paid contract follow-ups. These aren't charity; a paid engagement costs the company less than a bad hire. If a company balks at paying for 8 hours of your time, they're telling you the ROI math doesn't work, which means they're sending this to dozens of candidates and hoping one sticks.

And here's the contrarian take that took me years to internalize: if a company demands 20 hours for an initial screen and you walk away, you didn't lose an offer. You dodged a scope-creep culture. The real risk isn't declining; it's spending 20 hours and still losing to someone who had better chemistry in the 30-minute walkthrough. That's not career management. That's gambling with your time as the chips.

What Fair Actually Looks Like

Fair take-homes exist. They're just rare.

The format works when companies respect it: 90 minutes of focused work, a clear rubric, a 30-minute walkthrough to verify you wrote it and can explain it. Well-designed take-homes show 35% higher correlation with job performance compared to whiteboard interviews. The signal is real. The abuse is the problem, not the concept.

Companies that compensate candidates and set clear requirements see 85%+ completion rates. The ones that don't? 60 to 70%. The data is screaming the answer. Pay people for their time, scope the work honestly, and the format is better than every alternative.

900+ companies are listed on the Hiring Without Whiteboards repository, explicitly committing to fair evaluation. Some use pair programming on real codebases. Some use portfolio reviews. Some use short, focused take-homes with hard time caps. The alternatives exist. The industry just hasn't decided to care yet.

The uncomfortable truth is that when hiring fundamentals are correct (clear rubrics, structured evaluation, diverse interviewers), the format almost doesn't matter. Take-home, live coding, pair programming; they all work when the process respects the candidate. They all fail when it doesn't.

I've been on both sides of the interview table. I've been the candidate grinding through a 14-hour take-home for a form rejection. I've been on hiring panels where we evaluated submissions in under 10 minutes that took candidates an entire weekend. Both of those experiences made me angry for the same reason: the asymmetry is the point, not the bug.

So here's my question for anyone who's been through this recently: what's the most egregious take-home you've been asked to do, and did you finish it or walk away?

Tech Lost 150K Jobs in 2026. Data Engineering Gained 414%.

DataDriven — Tue, 19 May 2026 10:05:18 +0000

I got laid off once. Not from a data engineering role; from an analytics-adjacent contracting gig that evaporated when budgets got cut. I spent exactly one week feeling sorry for myself, then I started grinding. That was years ago. Since then I've watched three separate waves of "tech is over" panic, sat through two recessions worth of hiring freezes, and somehow ended up at staff level building pipelines at companies you've definitely used. The pattern is always the same: broad panic, selective survival, and a very small number of people who read the room correctly and came out the other side making more money than before.

2026 is that pattern again, except the signal is louder than it's ever been.

150,000 Jobs Gone. Data Engineering Didn't Flinch.

The numbers are ugly. Over 150,000 tech jobs cut across 500+ companies in 2026. Q1 alone saw 52,050 layoffs, a 40% jump over Q1 2025 and the worst first quarter since 2023. That's roughly 973 people per day losing their jobs. If you're in tech and you don't know someone who got hit, you're not paying attention.

But here's the part nobody's talking about at happy hour: data engineering is projected to grow 414% through 2030. Over 150,000 data engineers are currently employed, with 20,000+ new jobs created in the past year alone. The global data engineering services market hit $105 billion in 2026 and is growing at 15% annually.

These two facts exist simultaneously. Massive contraction and massive expansion, in the same industry, at the same time.

The displacement isn't random. Data analytics postings dropped 15.2% year over year. Broader tech postings fell 36%. But data engineering grew. Not "held steady." Grew. This isn't the whole boat rising; it's one lifeboat pulling away while the ship lists.

40% of data teams expanded headcount in 2025 (up from 14% the year before), even as 41% reported negative budget impacts from economic pressures. They're not adding headcount for fun. They're replacing other roles with engineers who build infrastructure.

That's the substitution nobody wants to name. Companies aren't growing data teams out of optimism. They're swapping analysts and BI developers for engineers who can build the plumbing that AI systems need to function. It's not growth; it's triage.

The GM Playbook: Fire IT, Hire Data Engineers

If you want to see the pattern in action, look at GM. In May 2026, they laid off 600 salaried IT workers, roughly 10% of their IT department. Identity access management, platform security, software engineering teams. Gone.

Then they immediately opened positions for data engineering, analytics, AI-native development, and cloud-based engineering.

This isn't a contradiction. It's a skills swap. GM didn't cut costs and call it a day. They cut roles they decided AI could handle or that weren't generating direct value, then reinvested in the roles they believe are load-bearing for the next five years. Data engineers made that list. Traditional IT didn't.

And GM isn't unique. The same pattern is playing out across the industry. Companies are discovering that 88% of their agentic AI pilots fail to reach production, not because the models are bad, but because the data infrastructure underneath them is a mess. Disconnected metadata catalogs, fragmented pipelines, schemas that nobody documented, cost optimization that nobody owns. Every failed AI pilot is a job posting for a data engineer.

The quote I keep seeing in industry reports: "Most teams are hiring data engineers to rebuild the plumbing: cleaner pipelines, faster ingestion, better monitoring, and datasets that can be trusted in production." That's the job. It's always been the job. Now there's a $105 billion market saying it out loud.

What AI Actually Automates (and What It Can't Touch)

Here's where most people get the career math wrong. They hear "AI is automating data engineering" and assume it's a uniform threat. It's not. The automation is extremely specific, and knowing which side of the line your skills sit on is the difference between a 414% growth curve and a pink slip.

The numbers on automation rates tell the story clearly. Data quality checks: 70% automatable. ETL pipeline generation: 65%. Database optimization: 58%. Data warehouse and lake architecture: 38%.

See the pattern? The further you move from "write this query" toward "design this system," the less AI can do. Boilerplate SQL generation? Gone. Figuring out why your pipeline silently dropped 2 million rows last Tuesday because an upstream team changed a schema without telling anyone? That's a human problem. It requires business context, institutional knowledge, and the ability to yell at the right Slack channel at 2am.

Python appears in 70% of 2026 data engineer postings. SQL dropped to 69%, down from 79% in 2025. That's not a typo. SQL, the language that defined data work for decades, is now less common in job postings than Python. The shift tells you exactly what companies are buying: less query-writing, more infrastructure architecture, more orchestration, more systems thinking.

Gartner projects AI will reduce manual intervention in data engineering by 60%. Which sounds terrifying until you realize the 40% that's left is all the hard stuff. Capacity planning across regions. Schema migrations that touch compliance rules. Cost optimization decisions where the CFO doesn't accept "AI said so" as justification. The comfortable middle of data engineering is getting automated. What's left is the stuff that actually requires judgment.

I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. The tools change every 18 months. Schema drift, late-arriving data, upstream teams breaking contracts without telling you; these are eternal.

The Two-Tier Market Is Already Here

The split isn't coming. It's here. And it's creating two very different career trajectories.

Tier one: Entry-level ETL work, boilerplate transformations, basic pipeline assembly. This is automating at 65-70%. If your daily work is writing the same dbt models and Airflow DAGs without understanding why the pipeline exists or what business decision it feeds, you're on the wrong side of this line.

Tier two: Architecture, cost optimization, governance, production debugging, ML infrastructure. This is growing. Fast. Data engineers now spend 37% of their work hours on AI-related projects, up from 19% in 2023, projected to hit 61% by 2027. The role isn't shrinking; it's shifting upward.

And the hiring market reflects this perfectly. 45% of data engineering postings now contain AI-related terms. CI/CD and DevOps appear in one out of every six postings. 26% of postings skip education requirements entirely; they don't care about your degree, they care about your production code samples.

Here's what that means for your hiring prospects. The companies with unfilled data engineering reqs sitting for 12-18 months aren't struggling because there aren't enough data engineers. They're struggling because there aren't enough data engineers with the right skills. It's a mismatch, not a shortage.

The most underrated part of this: analytics engineers earn a median of $189,000 versus data engineers at $131,000, yet analytics engineering isn't projected for the same growth. Companies are overpaying for the title they think they need while undertesting for the skills they actually need. I've been on hiring panels where we tested pipeline architecture for an analytics engineer role and business context for a data engineering role. Backwards. Every time.

What the Survivors Are Doing Differently

The people who come out of this cycle making $148,000 to $186,000 (the San Francisco range for data engineers right now) aren't the ones who learned one more tool. They're the ones who understood which problems compound.

Concepts transfer across tools; tool knowledge doesn't transfer across concepts. I've been saying this for years and it's never been more true. The engineer who understands data modeling, query optimization, and why things break will learn whatever orchestrator the company uses in a week. The engineer who memorized Airflow's API but can't explain why a star schema might not be the right choice anymore (hint: the economics killed it) is going to have a harder time.

The skill stack that's actually getting people hired: Python and SQL as baseline (still non-negotiable, even as SQL's dominance fades). Spark at 38.7% of postings. Cloud fluency, with AWS at 32% market share. And increasingly, AI literacy; not "build a transformer from scratch" but "understand how your pipelines feed ML systems and how to optimize that relationship."

The real career insight hiding in all of this data: production infrastructure beats research. Every time. Data engineers earning $130K-$180K while data scientists struggle for roles reflects a truth the industry doesn't like admitting. The CFO cares about the pipeline that feeds the board deck, not the model that got 2% better accuracy on a benchmark nobody uses.

Junior engineers worry about which tool to learn. Senior engineers worry about which problems to solve. Staff engineers worry about which problems to prevent.

That hierarchy maps directly onto the automation curve. Tools get automated. Problems don't. Prevention definitely doesn't.

I've watched people with 10 years of experience get downleveled because they couldn't articulate system design decisions under pressure. I've also watched people with non-traditional backgrounds land staff roles because they could explain exactly why their pipeline was designed the way it was and what it would cost to change it. The interview is a different skill than the job, but both skills reward the same thing: understanding the "why" behind the architecture, not just the "how" of the implementation.

The 150,000 jobs that vanished in 2026 aren't coming back. The 414% growth curve in data engineering isn't slowing down. The gap between those two numbers is the entire story of tech employment right now. The question is just which side of that gap you're standing on.

So: what's the one skill in your current stack that you're most worried AI is about to eat? And what are you replacing it with?

DSA Is Dying in DE Interviews. Nobody Agrees on What's Next.

DataDriven — Thu, 14 May 2026 10:05:33 +0000

I did somewhere around 20 interview loops in a single job search. Some went well. Some went so poorly I still think about them in the shower. But here's the thing: at least I knew what I was prepping for. LeetCode mediums, maybe a SQL round, maybe a system design conversation. The format was predictable, even if it was stupid. That era is over, and what replaced it is somehow worse.

The data engineering community has been screaming for years that DSA doesn't belong in DE interviews. Binary tree traversals, dynamic programming, graph algorithms; none of this maps to the actual job. The actual job is debugging why a pipeline silently dropped 2M rows last Tuesday, not implementing Dijkstra's algorithm on a whiteboard. Reddit finally agreed. r/dataengineering blew up over it. The "NoMoreBigONotations" thread went viral. Companies listened. They dropped the algorithmic rounds.

And then they replaced them with absolute chaos.

Why DSA Never Fit Data Engineering in the First Place

Let's be clear about something: LeetCode was never a valid proxy for data engineering skill. It was a borrowed ritual from software engineering interviews that nobody bothered to adapt. Data engineers are rarely expected to write complex algorithms from scratch. We use pre-built libraries and frameworks. The daily work is SQL, pipeline architecture, data modeling, debugging, cost optimization, and dealing with upstream teams who break contracts without telling you.

The best data engineers I've worked with would struggle on a LeetCode hard. And the engineers who ace competitive programming challenges? They frequently struggle with data modeling, pipeline design, and the kind of real-world optimization that actually matters. It's an inverse correlation, and it's been staring us in the face for years.

DSA is a mechanism to rank candidates; not an indicator of data engineering experience. Accept it for the arbitrary IQ measuring stick that it is.

26% of data engineering job ads in 2026 don't even mention education requirements anymore. The industry is finally pivoting toward practical skill assessment. Hiring timelines now exceed 60 to 90 days for complex enterprise roles. Interview loops run 5 to 7 rounds. And yet, the most important question remains unanswered: what are we actually testing for?

Most candidates don't fail data engineering interviews because of SQL or Python. They fail because they can't connect everything together under pressure and communicate it clearly. That's a completely different skill than reversing a linked list.

The Replacement: Three Interviews, Zero Consensus

Here's where it gets ugly. Companies dropped DSA and replaced it with whatever their hiring manager felt like that quarter. There is no standard. There is no consensus. There is barely even a pattern.

Company A wants you to do a 60-minute Cursor-based live build where you implement a feature in a real codebase. Company B wants pure system design: vague, open-ended, no single correct answer, and every interviewer weights trade-offs differently. Company C sends you the interview rules 24 hours before the onsite, and those rules contradict what the recruiter told you two weeks ago. Company D gives you an 8-hour take-home that's definitely 15 hours of work and pays you nothing for it.

If you're running parallel loops (and you should be; it's the only sane strategy), you are now simultaneously prepping for three completely different skill sets with zero overlap. One company allows Cursor, one bans it, one grades on "cleverness," one grades on "correctness." This isn't a hiring process. It's a lottery where you don't know which ticket you bought.

Startups compress everything into 2 to 3 rounds focused on "can you ship on day one." Big Tech runs 4 to 6 standardized rounds emphasizing system design and scale. Mid-market companies? They interview data engineers like they're software engineers, because nobody told them not to. Candidates get blindsided. You prep like it's a data role and walk into SWE-level production-grade coding requirements with full test suites.

For the architecture-style rounds, datadriven.io lets you work through the pipeline-design and data-modeling drills end-to-end instead of just reading about them. That matters, because system design is actually harder to prepare for than LeetCode. At least with DSA, there's consensus on what a good answer looks like. System design? No rubric. No "correct" answer. And every interviewer has a different opinion on whether you should optimize for cost, latency, or data freshness. You're training for a ghost target.

AI Made It Worse, Not Better

Here's the part nobody wants to say out loud: AI didn't lower the interview bar. It raised it invisibly.

Canva replaced its "Computer Science Fundamentals" round with "AI-Assisted Coding" in mid-2025. Candidates now face vague, open-ended challenges like "design an aircraft takeoff and landing control system." 64% of companies still ban AI in interviews, but 80% of candidates use LLMs anyway on take-homes. Meanwhile, 67% of startups explicitly allow AI. Meta, Rippling, Google, Canva, and Shopify all permit AI use in live technical sessions. The policy landscape is a mess.

One CTO told a candidate mid-interview to leave Cursor on. "We want to see how you solve this with AI." The problems got harder. When AI handles the boilerplate, the interviewer's expectations shift from "can you code?" to "can you architect while AI codes for you?" That's a completely different evaluation, and most candidates aren't ready for it.

The goal has evolved: interviewers want to understand how you evaluate, modify, and trust AI-generated answers. Seniors use AI to compress tedious work while maintaining design control. Staff engineers direct AI through complex tasks while monitoring quality. But here's the problem; nobody tells you which version of this test you're walking into. One company wants to see you pair-program with Cursor like it's a junior engineer on your team. The next company will disqualify you for opening ChatGPT.

Companies publicly mandate AI usage daily in production, then secretly ban it in interviews. That's not a hiring process. That's a credibility gap.

What Hiring Managers Say They Want (When They Bother to Say Anything)

I've been on hiring panels where we passed on strong candidates for the dumbest reasons. So let me tell you what actually separates the hires from the passes, at least at companies that have thought about it for more than five minutes.

They want problem-solving mindset over tool knowledge. If you walk into an architecture round and start listing tools instead of describing the problem you're solving, that's a concern. Concepts transfer across tools; tool knowledge doesn't transfer across concepts. This has always been true, and it's finally becoming the interview thesis at companies that are paying attention.

They want business literacy. A query that runs in 3 seconds instead of 30 might save a downstream BI team hours of waiting. Does the candidate connect technical decisions to business outcomes? If your pipeline is technically perfect but ignores downstream consumers or compliance, you're not a hire. You're a liability.

They want you to reason about boundaries. Don't propose a single-pattern solution. Describe the boundary between patterns and the contracts that flow across it. That's the senior signal. At staff level, they want to see you prevent problems, not just solve them.

The irony is thick: these are all reasonable things to test for. But about a third of interview loops include a dedicated data modeling round. A third. The single most important skill in data engineering, and two-thirds of companies don't even have a round for it. They'll spend 45 minutes on a LeetCode medium (or its chaotic replacement) and zero minutes on whether you understand grain, slowly normalized schemas, or why wide denormalized tables with complex types are eating star schema alive.

Cloud cost efficiency is now one of the highest-scored interview categories. Companies are tying bonus incentives to cloud cost optimizations. This makes sense. Storage is 2 cents per GB per month. Engineer time is $100 an hour. The economics killed star schema, and now they're killing the interview formats that don't test for economic reasoning.

The Real Problem Is Nobody Wants to Admit

The inconsistency isn't accidental. It's evidence that the role itself transformed faster than hiring practices could keep up.

Between 2023 and 2026, data engineering moved from "batch ETL plumber" to a role that combines real-time architecture, cloud cost optimization, metadata governance, platform engineering, and AI integration. Companies testing SQL plus system design plus Cursor builds aren't being random. They're testing for three different versions of the job simultaneously because they don't yet know which version matters most.

That's not an excuse. It's a diagnosis.

The community is furious not because DSA is gone, but because at least DSA was consistent. You could grind 50 mediums and be solid. Now? 97% of data engineers report burnout. 70% are likely to leave their jobs within 12 months. Hiring timelines stretch past 90 days. And at the end of that timeline, you might get an offer, be told it was sent, never receive it, do four more rounds, pass again, and have the headcount closed. I'm not making that up. That happened to me.

The interview process isn't designed for candidates. It's designed for companies to feel thorough. The data engineering community won the argument against DSA, and the prize was chaos.

I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. The tools change every 18 months. The problems don't change. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal. The interview formats will eventually stabilize around testing for these eternal problems.

Until then? Treat prep like a job. Accept that every loop will be different. Ask recruiters what types of questions to expect; and if you don't get good answers, look online and at the job description. Prep for system design, SQL fluency, data modeling, and yes, basic Python. Cover the surface area because nobody else is going to narrow it down for you.

What's the worst interview format you've encountered since companies started dropping DSA rounds? I genuinely want to know, because I thought my eight-round saga was bad, and I keep hearing stories that make it look quaint.

Junior Data Engineers Are Getting Wiped Out. Seniors Are Thriving.

DataDriven — Tue, 12 May 2026 10:05:09 +0000

Three years ago, a company I was at hired eight junior data engineers in a single quarter. Boilerplate ETL, basic SQL transforms, test scaffolding, docs. The standard apprenticeship pipeline. Last month, that same company posted two senior DE roles and zero junior ones. The eight seats are gone. Not frozen; gone. The work those engineers did still gets done. An LLM and two staff engineers handle it now.

This isn't a hot take. It's Q1 2026 by the numbers: 52,050 tech layoffs announced in the first three months of the year, a 40% jump over Q1 2025. Nearly half of those cuts were attributed to AI-driven automation. And the people getting cut aren't the ones designing pipeline architectures or negotiating data contracts with upstream teams. They're the ones writing the boilerplate that AI now generates on demand.

The seniority bifurcation in data engineering is real, it's accelerating, and if you're early in your career, you need to understand the mechanics of it before you can do anything about it.

The Junior Toolkit Got Automated First

Here's what a typical junior data engineer did two years ago: wrote basic ETL scripts, generated dbt models from specs, built simple Airflow DAGs, ran data quality checks, documented schemas. Useful work. Necessary work. Also, as it turns out, exactly the kind of work that LLMs are terrifyingly good at.

The numbers are brutal. 70% of data quality checks are now automated. 65% of ETL/ELT pipeline design can be generated by AI code assistants. SQL generation tools hit 90% accuracy on first pass. Developers report 88% productivity increases with AI, spending 60% less time on boilerplate code, database schemas, and API creation.

That's not "AI is coming for your job" fear-mongering. That's the specific, measurable erosion of the tasks that justified hiring someone at $72K to sit in a seat and learn.

The work isn't gone. The justification for hiring someone cheap to do it is.

Companies that used to bring on cohorts of 5 to 10 junior engineers now handle the same workload with 2 to 3 seniors plus AI tooling. Entry-level data engineer positions dropped 20 to 35% globally over the past 12 months. Recently hired workers (42%) and entry-level employees (41%) face disproportionate layoff risk compared to senior cohorts. The apprenticeship ladder that built every senior engineer reading this article is being pulled up behind us.

And here's the part that should make you uncomfortable if you're a senior who benefited from that ladder: this isn't a technology readiness problem. There's a fascinating gap in the data. Data engineers show 75% theoretical AI exposure but only 37% observed exposure. Companies know AI can automate junior work. Many just haven't pulled the trigger yet because complex data systems break in unexpected ways and they'd rather keep a human in the loop than risk a silent pipeline failure from auto-generated code.

That gap is closing. Fast.

Seniors Aren't Just Surviving; They're Getting Promoted

While junior roles contract, the senior market is doing something counterintuitive: growing. Senior data engineer compensation is up 12 to 18% year over year. Base salaries hold at $147K to $179K nationally, with top talent in SF commanding $233K. Engineers with Databricks or Snowflake certifications see a 10 to 15% premium on top of that. Roles with demonstrated AI skills command another 15 to 30% salary premium.

40% of data teams actually grew in 2025, up from 14% the year before, and budgets increased 30%. Read that again. Layoffs and growth are happening simultaneously. That's not contradictory; it's compositional. Companies are cutting junior headcount and reinvesting in senior hires who can own broader scope with AI leverage.

The global data engineering market hit $105 billion in 2026 and is projected to reach $213 billion by 2031. The Bureau of Labor Statistics projects 36% job growth through 2034. Data engineering is not dying. It's not shrinking. It's getting more expensive and more senior.

I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal. AI doesn't fix them because they're not code problems; they're judgment problems, communication problems, business context problems. The kind of problems you can only solve after years of getting burned by them.

The role is shifting from pipeline plumber to system architect. Senior DEs are moving up the stack while entry-level boilerplate gets consumed by tools. The engineers who thrive won't write the most SQL; they'll design the frameworks that let AI write SQL safely.

The Skills That Actually Matter Now

The bar for what counts as "data engineering skills" moved. A few years ago, you could be a strong DE focused mainly on batch ETL and warehousing. Now teams expect you to support ML workflows, real-time data needs, governance, and cost optimization, all under the same job title.

Streaming infrastructure went from "nice to have" to competitive moat. Uber launched IngestionNext in March 2026, cutting data latency from hours to minutes and reducing compute costs 25% with Kafka, Flink, and Hudi. I still maintain that most companies don't need streaming (most of y'all don't), but the companies that do need it are the ones paying $250K+ for the engineers who can build it.

Cloud proficiency is non-negotiable; over 94% of enterprises have adopted cloud. AI skill requirements appear in 71% of U.S. tech job postings, up 181% year over year. And the real shortage isn't data engineers; it's governance experts wearing data engineer hats. Companies that used to treat governance as a separate function now embed it in every DE hire. If you can articulate data lineage, PII handling, and audit trails, you command a premium. If you can only write Spark jobs, you're becoming a commodity.

The concept still holds: learn data modeling, query optimization, understanding why things break. Those transfer across every tool. But the floor has risen. The minimum viable senior DE in 2026 needs architecture thinking, AI fluency, governance awareness, and cloud-native platform skills. For the architecture and data modeling side of interview prep, datadriven.io lets you work through pipeline-design and modeling drills end-to-end instead of just reading about them; that kind of hands-on practice is what actually builds the muscle.

Hiring timelines for senior roles have stretched to 60 to 90 days in enterprise settings. That's not bureaucracy; that's scarcity. Companies can't find enough people who combine architecture, AI integration, governance, and platform engineering in a single candidate. The 250,000-person shortage in AI/ML skillsets compounds everything.

Can Juniors Still Break In?

Yes. But not the way it used to work.

The direct path into data engineering is mostly gone. "Data engineer" is not an entry-level position. It combines business context, analytics insight, infrastructure, software engineering, and SRE. The industry consensus now expects 2 to 6 years of prior experience, not a first career jump.

The realistic path looks like this: start as a SQL-heavy data analyst, analytics engineer, DBA, or backend engineer. Spend 18 to 24 months building production experience and domain knowledge. Then transition to DE internally or through a targeted job search. This detour is becoming standard, not exceptional.

If you're 3 years into an adjacent role running pipelines in production, that's not "close to being ready." You're doing the job. Stop discounting what you've already built.

Portfolio projects help demonstrate skills but rarely replace production experience. That's the catch-22. You can't get production experience without the role, and you can't get the role without production experience. The way through is the adjacent role. Analyst to analytics engineer to data engineer. It's longer. It works.

IBM tripled entry-level hiring in 2026, explicitly stating that AI still needs a human touch. That's an outlier, but it proves the path isn't completely closed. Some enterprises still see juniors as necessary friction-catchers. The BLS projects data engineering as one of the fastest-growing roles through 2030. The demand is there; it's just shifted upward in seniority.

Here's what I'd tell anyone trying to break in right now: stop learning tools. Learn concepts. Data modeling is the core skill. Getting the model wrong upstream means everything downstream is pain. Pick one orchestration tool, build something small that forces you to deal with failures, retries, and alerting. Then pick the next one. Treat the job search like a job. I did somewhere around 20 interview loops in a single search. Some went well. Some went laughably poorly. The grind is the strategy.

The Ladder Problem

The uncomfortable truth behind all of this is structural. AI creates more high-leverage work for seniors while erasing the stepping stones juniors traditionally used to become seniors. The boilerplate ETL, the basic SQL, the test generation; that was the apprenticeship. That was how you learned why pipelines break, how schemas drift, what happens when upstream teams push breaking changes at 2am. If AI handles all of that, where do future senior engineers come from?

Nobody's talking about this enough. The industry is celebrating productivity gains without asking what the pipeline (the human one) looks like in five years. Junior engineers who never debug a failed DAG because AI handles it won't develop the foundational understanding necessary to debug complex systems when the AI fails. And AI will fail. It always does, usually at 2am, usually on the pipeline that finance depends on for board decks.

The data engineering career isn't dying. It's bifurcating. Senior roles are growing, compensation is climbing, and the problems are getting harder and more strategic. Junior roles are contracting, the bar for entry is rising, and the old apprenticeship model is breaking down. Both of these things are true simultaneously.

I'm not a doomer about this. The field is healthy, expanding, and full of hard problems worth solving. But the path in looks nothing like it did three years ago, and pretending otherwise is a disservice to every bootcamp grad refreshing LinkedIn right now.

If you're senior: you're in a strong position. Use the leverage. Learn the AI tooling. Move up the stack.

If you're junior: the path is longer and harder than it was. That's not your fault. It's the industry being the industry. Start adjacent, build real production experience, focus on concepts over tools, and grind.

What's your read on the junior pipeline problem? Are we building a generation of seniors who never went through the apprenticeship, or will the path just look different? Genuinely curious what people on both sides are seeing.

DEV Community: DataDriven

Data Engineer Salaries in 2026: The Numbers Are Lying

Why Every Salary Site Disagrees by $120K

Role Title Chaos Is Pricing You Against the Wrong Pool

2023 Job Ads Are Still Haunting Your 2026 Number

Layoffs Created a Tier, Not a Glut

What Number to Actually Put in the Field

Your Data Engineering Take-Home Is Now 20 Hours of Free Work

The Scope Creep Nobody Talks About

AI Banned, Rubrics Unchanged

70% of You Will Never Hear Why

What Actually Works

The System Knows It's Broken

Top 12 Pipeline Architecture Interview Questions, With Answers

1. Design an idempotent ingestion pipeline for a high-volume event stream

2. How would you handle a schema change from an upstream producer you don't control?

3. Batch or streaming: how do you decide?

4. Design a backfill strategy that won't corrupt live data

5. Where do you place data quality checks in a pipeline?

6. How do you handle late-arriving data?

7. Design a pipeline with fan-out/fan-in dependencies

8. Explain the tradeoffs between Lambda and Kappa architecture

9. How do you prevent alert fatigue in pipeline monitoring?

10. How do you enforce schema contracts across teams?

11. Design for exactly-once semantics in a payment processing pipeline

12. Your pipeline silently dropped 40% of records for six months. How do you find out, and how do you prevent it?

The 12 Data Modeling Interview Questions that Matter

1. Define the Grain of a Fact Table

2. Star Schema vs. Snowflake Schema

3. Design a Fact Table for E-Commerce Orders

4. SCD Type 2 Implementation

5. Late-Arriving Dimensions

6. Bridge Tables for Many-to-Many Relationships

7. Fact vs. Dimension Classification

8. Factless Fact Tables

9. Accumulating Snapshot Fact Table

10. Conformed Dimensions

11. Normalization vs. Denormalization for Analytics

12. Late-Arriving Facts and Backfills

Top 12 Spark Interview Problems for Data Engineers, With Answers

1. Identify the Shuffle

2. Tune Shuffle Partitions

3. Force a Broadcast Join

4. Handle Data Skew with Salting

5. Predict AQE Behavior

6. Read a Catalyst Plan

7. Window Function Partition Skew

8. Cache Storage Level Trade-off

9. Diagnose an Executor OOM

10. Join Type Selection Under Constraints

11. Eliminate an Unnecessary Shuffle

12. AQE Partition Coalescing

Top 12 SQL Interview Problems for Data Engineers, With Answers

1. Customers Who Spent Over $500

2. The Double-Counting Trap

3. Latest Record Per Customer

4. Deterministic Deduplication With Tie-Breaking

5. Customers With No Orders (Anti-Join)

6. NOT IN vs NOT EXISTS With NULLs

7. Recursive CTE Org Chart

8. Self-Join for Consecutive-Day Logins

9. ROW_NUMBER vs RANK vs DENSE_RANK

10. Running Total With Correct Window Frame

11. Find the Gap in a Sequence

12. Sessionize an Event Stream

Top 12 Python Interview Problems for Data Engineers, With Answers

1. Parse Log Lines Into Structured Records

2. Stream a Large File With a Generator

3. Count Event Frequencies

4. Deduplicate Records by Composite Key

5. Top K Most Frequent User IDs

6. Multi-Key Sort

7. Sliding Window: Maximum Throughput

8. Group and Aggregate With Plain Python

9. Flatten Nested JSON

10. Merge Two Sorted Iterators

11. Validate and Quarantine Bad Rows

12. Chunk a File for Parallel Processing

80% of DE Candidates Use AI on Take-Homes. Companies Can't Stop It.

The Honest Candidate Tax