<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DataDriven</title>
    <description>The latest articles on DEV Community by DataDriven (@datadriven).</description>
    <link>https://dev.clauneck.workers.dev/datadriven</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864671%2F923e8540-fa96-491d-adb6-0e01c42ec26a.png</url>
      <title>DEV Community: DataDriven</title>
      <link>https://dev.clauneck.workers.dev/datadriven</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.clauneck.workers.dev/feed/datadriven"/>
    <language>en</language>
    <item>
      <title>Data Engineer Salaries in 2026: The Numbers Are Lying</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 25 Jun 2026 10:07:39 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/data-engineer-salaries-in-2026-the-numbers-are-lying-5877</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/data-engineer-salaries-in-2026-the-numbers-are-lying-5877</guid>
      <description>&lt;p&gt;Last year I was helping a friend prep for senior data engineer interviews. He'd been building pipelines at a Series B for four years, solid production experience, and wanted to know what number to put in the salary field. So he did what everyone does: checked Glassdoor, Indeed, PayScale, and Levels.fyi.&lt;/p&gt;

&lt;p&gt;He got four numbers. They disagreed by over $120,000.&lt;/p&gt;

&lt;p&gt;Glassdoor said $133K. PayScale said $100K. Indeed's senior number sat at $216K. Levels.fyi split the difference at $157K. Same title, same country, same year; four answers that can't all be right. And here's the thing: none of them are lying. They're just counting different people, over different time windows, with different biases baked in. The result is that candidates trying to benchmark their &lt;strong&gt;data engineer salary&lt;/strong&gt; are pricing themselves against a number that doesn't represent their actual market.&lt;/p&gt;

&lt;p&gt;This is a problem. In a hiring environment where 52,050 tech workers got laid off in Q1 2026 alone, where senior roles take 60 to 90 days to fill, and where title inflation has made "data engineer" mean three different jobs depending on who's posting, getting your number wrong has real &lt;strong&gt;career&lt;/strong&gt; cost. You either leave $30K on the table or you overshoot and get ghosted. Both outcomes trace back to the same root cause: the data you're benchmarking against is broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Every Salary Site Disagrees by $120K
&lt;/h2&gt;

&lt;p&gt;Each major &lt;strong&gt;compensation&lt;/strong&gt; source has its own rot problem. Understanding the bias is more useful than trusting the number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Glassdoor&lt;/strong&gt; reports $133,484 average across 32,984 submissions. The issue: it's entirely self-reported, and higher earners submit more frequently. The person who just got a $180K offer is more motivated to log it than the person who accepted $115K and moved on. The sample skews up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PayScale&lt;/strong&gt; reports roughly $100K. That sounds low because it is; 71% of their data engineering respondents are mid-level or junior. PayScale validates every data point and refreshes on sub-90-day cycles, which makes it the most accurate floor for what actually clears at offer stage. But candidates see $100K and panic. They shouldn't. They're looking at a junior-weighted average.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Indeed&lt;/strong&gt; sits at $216K for senior roles. The problem here is temporal: Indeed averages job postings going back 36 months. Their June 2026 number includes postings from June 2023, before the layoff waves, before signing bonus compression, before the market shifted. You're benchmarking against fossil data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Levels.fyi&lt;/strong&gt; pegs the median at $157,450, but this population skews heavily toward top-tier tech companies and excludes non-tech firms where data engineers earn 20 to 35% less. Google's median is $278K. Capital One's is $130K. That's a $148K spread for the same title on the same platform.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The salary data isn't wrong. It's measuring different populations, different time windows, and different definitions of the job. Once you know which population you're in, the number becomes useful. Until then, it's noise.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The practical damage is real. A mid-market data engineer sees Glassdoor's $133K, anchors there, and never learns that the number includes FAANG outliers pulling the average up. Or worse, they see Indeed's $216K senior figure and counter-offer at a number that makes the hiring manager close the tab.&lt;/p&gt;

&lt;h2&gt;
  
  
  Role Title Chaos Is Pricing You Against the Wrong Pool
&lt;/h2&gt;

&lt;p&gt;Here's the less obvious problem: the job you're benchmarking might not even be the job you're doing.&lt;/p&gt;

&lt;p&gt;Analytics engineers earn $155K to $195K median in 2026. ML engineers command a 38% &lt;strong&gt;salary&lt;/strong&gt; premium over data engineers at mid-career. Data scientists occupy yet another band. These are different roles with different compensation structures. But companies routinely mislabel them.&lt;/p&gt;

&lt;p&gt;Analytics engineer postings grew 114% from 2023 to 2024, yet dbt Labs openly admits the title boundaries are blurring. Analysts drift into dbt modeling. Data engineers adopt dbt as standard tooling. The result: a "$150K dbt role" could be transformation work (analytics engineer) or pipeline infrastructure (data engineer), and the salary sites have no idea which one they're counting.&lt;/p&gt;

&lt;p&gt;37,000 &lt;strong&gt;data engineering&lt;/strong&gt; jobs post monthly on average, but a significant portion of those are mislabeled analytics engineer, ML engineer, or data scientist roles. When a company posts "Senior Data Engineer" but the job is really dbt plus Snowflake plus stakeholder dashboards, that's an analytics engineer role at data engineer pricing. The candidate benchmarks against infrastructure DE salaries ($115K to $160K) when they should be benchmarking against analytics engineer salaries ($155K to $195K). That's a $30K to $40K miss.&lt;/p&gt;

&lt;p&gt;The reverse kills you too. An analytics engineer who sees ML engineer salary data and anchors at $190K gets rejected as "overpriced" for the actual scope.&lt;/p&gt;

&lt;p&gt;The litmus test isn't the title. It's the job description. If it says dbt, Snowflake, and "stakeholder reporting," you're an analytics engineer regardless of what the posting calls you. Benchmark accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  2023 Job Ads Are Still Haunting Your 2026 Number
&lt;/h2&gt;

&lt;p&gt;Indeed's 36-month lookback window deserves its own section because the implications are worse than they look.&lt;/p&gt;

&lt;p&gt;In 2023, the median data engineer salary was $117,446. By June 2026, Indeed reports $136,776. That looks like 16.5% growth over three years, which isn't terrible. But the number is being held down by every posting from 2023 and 2024 that's still sitting in the average.&lt;/p&gt;

&lt;p&gt;Here's what makes this especially misleading: 68% of tech job postings included explicit salary ranges in 2025, up from 45% in 2023. Pay transparency laws made the data more granular. But Indeed weights all 36 months equally. A vague salary range guess from a 2023 pre-transparency posting counts the same as a precise, legally mandated range from 2026. Higher sample size, stale composition.&lt;/p&gt;

&lt;p&gt;Then there's the ghost job problem. One-third of employers admit to posting inactive roles. Greenhouse data found 18 to 22% of listings are never filled. Stale 2023 postings are more likely to be dormant, and they're inflating the denominator. You're benchmarking against jobs that don't exist anymore.&lt;/p&gt;

&lt;p&gt;The senior role divergence tells the real story. The $60K gap between mid-level ($133K) and senior ($175K) data engineers in 2026 suggests the market has repriced for experience. But the aggregate average is anchored by fossils. If you're mid-career, the number you see is artificially low.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layoffs Created a Tier, Not a Glut
&lt;/h2&gt;

&lt;p&gt;52,050 tech workers laid off in Q1 2026. A 40% jump over Q1 2025. Oracle cut 21,000. Amazon cut 16,000. Dell cut 11,000. Sounds like a buyer's market.&lt;/p&gt;

&lt;p&gt;It's not. Or at least, not uniformly.&lt;/p&gt;

&lt;p&gt;Those 52K cuts coexist with 67,000 active software engineering job postings in the same quarter, the highest posting volume in three years. Companies are cutting commoditized roles while hoarding data engineers, ML engineers, and security specialists. A junior full-stack engineer is in a buyer's market; a senior data engineer with Airflow and Spark experience is not. The "layoff market" narrative breaks down completely by skill tier.&lt;/p&gt;

&lt;p&gt;But here's the asymmetry that actually matters: companies take 60 to 90 days to fill senior roles because they're running multiple candidates in parallel. Individual candidates spend 3 to 9 months searching. The employer can wait. The candidate runs out of severance. That's where negotiation leverage shifts; not because the market is soft, but because one side has a deadline and the other doesn't.&lt;/p&gt;

&lt;p&gt;The data on negotiation is striking. Data engineers who negotiate earn $24,479 more annually, an 18.83% increase. 85% of counter-offers get at least partial acceptance. 70% of hiring managers expect you to negotiate. Only 44% of candidates actually do it. The $120K gap between salary sources is partly a measurement problem, sure. But it's also partly behavioral. The spread between 25th and 75th percentile reflects negotiation winners vs. passive accepters, not just market fragmentation.&lt;/p&gt;

&lt;p&gt;Engineers with current cloud and security skills close offers in 2 to 4 weeks. Everyone else faces the full timeline. Skill specificity determines leverage more than market conditions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Number to Actually Put in the Field
&lt;/h2&gt;

&lt;p&gt;Stop averaging the averages. Here's the hierarchy of sources, from most to least useful for your &lt;strong&gt;career&lt;/strong&gt; planning:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Levels.fyi&lt;/strong&gt; is best for FAANG and top-tier tech. Filter by company, level, and location. The by-company variance is massive ($278K at Google vs. $130K at Capital One), so the aggregate median is useless. You need the company-specific number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Glassdoor&lt;/strong&gt; is useful for the 25th to 75th percentile range at your target company, if they have enough submissions. The $141K to $219K senior DE range tells you more than the $175K mean.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PayScale&lt;/strong&gt; is the most accurate floor. If you're at a non-tech company or early in your career, this is closer to your reality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Indeed&lt;/strong&gt; is the least useful for current benchmarking. The 36-month window buries the signal.&lt;/p&gt;

&lt;p&gt;The actual number you put in the field should be the Levels.fyi or Glassdoor 75th percentile for the specific company, then negotiate. If 70% of hiring managers expect negotiation, pricing yourself at the median is pricing yourself to get negotiated down.&lt;/p&gt;

&lt;p&gt;And one more thing the salary sites never show you: base salary is barely over half the employer's total cost to hire. That $200K base offer costs the company $240K to $290K when you add payroll tax, benefits, recruiting fees (18 to 25% of first-year base), and onboarding ramp. They have more room than you think. The question is whether you know enough about your own market to ask for it.&lt;/p&gt;

&lt;p&gt;If you're prepping for the senior and staff loops where compensation actually diverges, strip back the "system design for software engineers" mentality; we built system design for data engineers with datadriven around pipeline architecture problems, not the load-balancer trivia that SWE prep loves and DEs never face on the job.&lt;/p&gt;

&lt;p&gt;The salary data is broken. The titles are broken. The timelines are longer. None of that changes the fact that data engineering compensation is strong and growing for engineers who know what they're actually worth. The trick is figuring out which population you belong to, not which average to believe.&lt;/p&gt;

&lt;p&gt;What's the biggest gap you've seen between what a salary site reported and what you actually earned or were offered?&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>career</category>
      <category>interview</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Your Data Engineering Take-Home Is Now 20 Hours of Free Work</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:15:44 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/your-data-engineering-take-home-is-now-20-hours-of-free-work-10l4</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/your-data-engineering-take-home-is-now-20-hours-of-free-work-10l4</guid>
      <description>&lt;p&gt;I got a &lt;strong&gt;take-home&lt;/strong&gt; assignment last year from a company I was genuinely excited about. "Should take about four hours," the recruiter said. Build an ingestion pipeline, model the data, write tests, document your design decisions, and prepare a 15-minute presentation walkthrough for the panel. Four hours. I laughed, closed my laptop, and started on it the next morning like it was a sprint. Sixteen hours later I had something I was proud of. Clean pipeline, solid tests, real documentation. I submitted it on a Sunday night. Monday I got a form rejection. No notes. No feedback. Not even which stage I failed. Just "we've decided to move forward with other candidates" and a link to their Glassdoor page.&lt;/p&gt;

&lt;p&gt;That was the moment I stopped pretending take-homes are assessments. They're consulting gigs. Unpaid ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scope Creep Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Five years ago, a &lt;strong&gt;data engineering&lt;/strong&gt; take-home was a focused exercise. Model this dataset into a star schema. Write a few SQL transforms. Maybe a short README. Two to four hours, tops. Bounded, reasonable, and actually useful for evaluating how someone thinks about data.&lt;/p&gt;

&lt;p&gt;That version is dead.&lt;/p&gt;

&lt;p&gt;Today, 68% of companies use take-home tests, up 12% year over year. And the scope has quietly ballooned into something unrecognizable. Full pipeline implementations. Test suites with coverage thresholds. Documentation that reads like a design doc. A presentation follow-up where you defend your architecture to a panel. We're talking 10 to 20 hours of work, routinely, for a role you haven't been offered.&lt;/p&gt;

&lt;p&gt;Industry best practice caps take-homes at 90 minutes of expected effort. The reality? Candidates consistently take 2x longer than company estimates to reach submission quality. That "four-hour" assignment is an eight-hour assignment. That "weekend project" is a week of evenings. And 25% of companies are still handing these out like they're reasonable asks.&lt;/p&gt;

&lt;p&gt;Here's the part that makes my eye twitch: 71% of engineering leaders openly say take-homes no longer generate useful signal. AI has degraded the format so completely that leaders themselves rate take-home signal as "degrading fastest" among all assessment types. They know it's broken. They keep doing it anyway.&lt;/p&gt;

&lt;p&gt;The attempted fix is even worse. Companies panicked about AI usage and responded by inflating scope. The logic, if you can call it that: make the assignment so large that AI can't do it alone. Except longer assessments don't defeat AI; they defeat candidates. Candidates with kids. Candidates working full-time jobs. Candidates from non-traditional backgrounds who can't burn 20 hours on a maybe. One candidate documented spending 32 hours on a single assignment, then got rejected for omitting a feature that was never mentioned in the requirements. Another was asked to build a learning module that would've billed at $2,800 as freelance work.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A four-hour take-home is a fair test. A 20-hour take-home is free consulting dressed up as an interview.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;59% of job seekers now say unpaid take-home assignments are the number one reason they won't apply. Not comp, not culture, not location. The assessment itself is the dealbreaker.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Banned, Rubrics Unchanged
&lt;/h2&gt;

&lt;p&gt;Two thirds of companies ban AI use in their &lt;strong&gt;interview&lt;/strong&gt; process. Sounds decisive. Except fewer than 30% of those companies have actually updated their assessments or retrained their interviewers. They slapped a "no AI" sticker on a 2015-era take-home and called it policy.&lt;/p&gt;

&lt;p&gt;The enforcement gap is almost comical. One company measured 80% of candidates using LLMs on take-home tests despite an explicit prohibition. AI cheating on take-homes doubled from 15% to 35% between June and December 2025. In purely technical roles, 48% of candidates show signs of unauthorized AI use. The ban is a suggestion, not a guardrail.&lt;/p&gt;

&lt;p&gt;Meanwhile, the rubrics these companies grade against were built to evaluate raw coding speed and syntax accuracy. Those signals collapsed the moment Claude could produce a clean solution in seconds. But nobody rewrote the rubric. Nobody redefined what "good" looks like when the baseline output quality shifted. Hiring managers score problem-solving and architecture judgment, but the assessment they hand out measures code-from-scratch, a skill that's now commodity.&lt;/p&gt;

&lt;p&gt;The split in the industry tells you everything. Meta and Shopify openly invite AI tools into their assessments. They've decided to test "can you use AI well" rather than "can you code without it." Goldman Sachs and Amazon maintain hard bans for candidates while investing heavily in internal AI tools for their own engineers. The hypocrisy is so blatant it's almost impressive. You can't use AI to get hired here, but once you're in, you'd better use it or you're slow.&lt;/p&gt;

&lt;p&gt;Banning AI in interviews creates a discontinuity between evaluation and production. In 2026, writing code without AI assistance is the exception, not the norm. You're testing candidates in an environment that doesn't reflect the environment they'll work in. That's not assessment; that's theater.&lt;/p&gt;

&lt;h2&gt;
  
  
  70% of You Will Never Hear Why
&lt;/h2&gt;

&lt;p&gt;Here's the stat that should make every &lt;strong&gt;hiring&lt;/strong&gt; manager uncomfortable: 69.7% of candidates receive zero feedback after rejection. Not "insufficient feedback." Zero. Nothing. A form email and silence.&lt;/p&gt;

&lt;p&gt;61% of candidates report being ghosted entirely after interviews. No rejection, no closure. Just silence from a company that asked them to spend a weekend building a pipeline.&lt;/p&gt;

&lt;p&gt;Companies hide behind legal risk. "We can't give feedback because candidates might sue." This is, to put it plainly, nonsense. Employment law distinguishes between subjective rejection reasons ("you seemed low-energy") and factual, role-specific feedback ("your schema migration approach didn't handle the edge case we were testing for"). The second type is almost litigation-proof. No engineer has successfully sued a company over constructive technical feedback. The legal defense is a myth that compliance teams perpetuate because "say nothing" is the lowest-variance strategy. It's organizational laziness wearing a legal costume.&lt;/p&gt;

&lt;p&gt;The business case against silence is overwhelming. 79% of candidates would reapply to a company if they'd received feedback. Recruiters who share feedback see a 126% increase in candidate referrals. Companies withholding feedback aren't just being rude; they're burning bridges they'll need to cross again in 18 months when they're hiring for the same role.&lt;/p&gt;

&lt;p&gt;But here's the real cruelty. When the assessment demands 10 to 20 hours, and the rejection carries zero feedback, you've extracted labor and returned nothing. Not compensation, not signal, not even a paragraph explaining what to work on. The candidate can't even reuse the learning because there is no learning. It's labor arbitrage dressed up as a &lt;strong&gt;career&lt;/strong&gt; opportunity.&lt;/p&gt;

&lt;p&gt;Only 17% of external candidates receive feedback, compared to 65% of internal candidates. If you already work there, you get a debrief. If you're on the outside spending your weekend on their assignment, you get a template. The double standard is institutional.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works
&lt;/h2&gt;

&lt;p&gt;The good news: some companies figured this out. The better news: it's not complicated.&lt;/p&gt;

&lt;p&gt;Live debugging interviews, running 60 to 90 minutes, are replacing puzzles at companies like Cloudflare, Datadog, and GitHub. Candidates get a broken system. They debug it. Interviewers watch the process: how do you form a hypothesis, how do you narrow the search space, do you narrate your thinking. You're evaluated on engineering judgment, not memorization speed. A candidate who thinks aloud and corrects wrong hypotheses scores higher than one who guesses fast but can't explain why.&lt;/p&gt;

&lt;p&gt;For senior and staff roles, pair programming on a debugging or refactoring task is the highest-signal round you can run. Forty-five minutes, real code, real collaboration. It surfaces the kind of judgment that 20-hour take-homes never could, because judgment shows up in conversation, not in a solo sprint nobody watches.&lt;/p&gt;

&lt;p&gt;Uber runs a two-hour on-site schema critique instead of toy problems. Stripe bounds their take-homes to one to three hours with clear scope. Both companies report higher completion rates and better signal than the bloated formats they replaced.&lt;/p&gt;

&lt;p&gt;The pattern is obvious: bounded time, realistic work, human interaction. If you want to know how someone debugs a broken DAG, hand them a broken DAG and watch. Don't ask them to build one from scratch over a weekend and then ghost them.&lt;/p&gt;

&lt;p&gt;If you're a candidate stuck grinding through these loops, focus your prep on the concepts that transfer across every format: data modeling, pipeline architecture, query optimization. I've found that a resource like datadriven.io is good for etl interview questions if you want structured reps on the technical fundamentals without wading through another generic course. The game is arbitrary, but the concepts compound regardless of which format a company throws at you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The System Knows It's Broken
&lt;/h2&gt;

&lt;p&gt;72% of job seekers report negative mental health impacts from lengthy hiring processes. Candidate ghosting hit a three-year high in 2026. The market has 2.2 million fake openings monthly, candidates respond with AI-powered mass applications, companies respond by banning AI, and the entire system spirals further from producing any useful signal for anyone.&lt;/p&gt;

&lt;p&gt;The profession acknowledges the assessment is unreliable while refusing to stop using it. This isn't a transitional phase. It's institutional paralysis. Companies would rather extract 20 hours of free work from someone they'll reject silently than spend 90 minutes on a live session that actually reveals how an engineer thinks.&lt;/p&gt;

&lt;p&gt;I've been through enough of these loops to know the system doesn't reform itself. It changes when candidates refuse to participate and when hiring managers with enough authority say "this is stupid, let's stop." If you're in a position to design an interview process, bound the time, provide feedback, and evaluate how people think, not how much free labor they'll tolerate.&lt;/p&gt;

&lt;p&gt;If you've done one of these 20-hour take-homes recently: what was the assignment, and did you hear anything back?&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Top 12 Pipeline Architecture Interview Questions, With Answers</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 18 Jun 2026 10:04:24 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/top-12-pipeline-architecture-interview-questions-with-answers-cf7</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/top-12-pipeline-architecture-interview-questions-with-answers-cf7</guid>
      <description>&lt;p&gt;I've sat on both sides of the &lt;strong&gt;system design&lt;/strong&gt; table enough times to know what separates a hire from a no-hire at the senior level. It's not SQL syntax. It's not whether you know the Spark API. It's whether you can talk about &lt;strong&gt;pipeline architecture&lt;/strong&gt; like someone who's been paged at 2am because a pipeline silently dropped 2M rows and nobody noticed for six days. These 12 &lt;strong&gt;data pipeline&lt;/strong&gt; interview questions are the ones I've seen decide loops at companies everyone's heard of. They don't have one right answer; they have a constellation of tradeoffs, and your job is to show you've navigated those tradeoffs in production, not just read about them on a blog.&lt;/p&gt;

&lt;p&gt;The signal between senior and mid-level candidates is whether you bring up the failure modes before the interviewer prompts you. Strong answers surface at least two failure modes unprompted: partial writes, schema drift, dedup edge cases, exactly-once semantics. If you're prepping for these, i used datadriven for data pipeline interview prep and it's the best resource I've found for this category. Want to practice these for real? &lt;a href="https://datadriven.io/data-pipeline-interview-questions" rel="noopener noreferrer"&gt;Solve these problems live here&lt;/a&gt; with a real editor and graded solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Design an idempotent ingestion pipeline for a high-volume event stream
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; You're receiving 500K events per minute from a Kafka topic into a data warehouse. Events can be delivered more than once. Design an ingestion pipeline that guarantees no double-counting in downstream analytics, even after retries or partial failures.&lt;/p&gt;

&lt;p&gt;The answer starts with the word &lt;strong&gt;idempotent&lt;/strong&gt; before anything else. Your sink must produce the same result whether a record is written once or five times. The standard pattern is partition-level overwrites: each run targets a specific time partition, deletes existing data for that partition, and writes the full replacement set. Alternatively, upsert (INSERT ... ON CONFLICT) keyed on a natural business key or a deterministic hash of the event payload. Never key your dedup on &lt;code&gt;processed_at&lt;/code&gt; timestamps; the same source record reprocessed at different times creates different target records, and you've silently violated idempotency while every validation check passes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is the most common opener in &lt;strong&gt;data engineering&lt;/strong&gt; interviews because it immediately reveals depth. Mid-level candidates say "use Kafka's exactly-once." Senior candidates explain that Kafka producer idempotence only holds within a single connection/partition pair and breaks across restarts or partition reassignment. The real answer is: at-least-once delivery with idempotent sinks is operationally equivalent to exactly-once &lt;em&gt;without the coordination overhead&lt;/em&gt;. Exactly-once semantics costs 2-5ms latency and a 10-20% throughput reduction. For analytics pipelines where dashboards recalculate on query, that's wasted money. Reserve exactly-once for financial transactions.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. How would you handle a schema change from an upstream producer you don't control?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; An upstream team adds a new field, renames an existing column, and deprecates another. Your pipeline consumes this data. Walk through your approach.&lt;/p&gt;

&lt;p&gt;You need to distinguish three types of changes. Additive changes (new fields) are backward-compatible; your pipeline should ignore unknown fields and not break. Renaming is a breaking change, full stop. Column removal is breaking. The answer is &lt;strong&gt;schema contracts&lt;/strong&gt; enforced at write time, not read time. Register schemas in a schema registry (Avro with Confluent Schema Registry or Protobuf with field numbers). Enforce backward compatibility checks in CI before the producer can publish. On the consumer side, pin to a known schema version and fail loudly on incompatible changes rather than silently ingesting garbage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Snowflake's own December 2025 outage was caused by a backward-incompatible schema change that took down 10 of 23 global regions for 13 hours. If Snowflake can't get this right internally, your upstream team definitely can't. The interviewer is checking whether you design for upstream chaos. The trap: candidates who say "just use mergeSchema=true" get dinged. Blindly enabling mergeSchema leads to 200-column tables nobody trusts and downstream chaos disguised as data quality. The difference between "hire" and "strong hire" is knowing when NOT to use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Batch or streaming: how do you decide?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Your stakeholder says they want "real-time" data. Walk through how you'd determine whether to build a batch or streaming pipeline.&lt;/p&gt;

&lt;p&gt;First question back to the interviewer: what's the actual latency requirement? If the answer is "we want dashboards updated every morning," that's batch. A daily batch job running 20 minutes for $5 beats a streaming pipeline costing $500/day with a dedicated on-call engineer. Default to batch unless there's a clear latency requirement under 5 minutes. Batch is simpler to build, cheaper to run, easier to debug, and produces deterministic outputs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Most companies don't have real-time data needs. They have real-time data &lt;em&gt;wants&lt;/em&gt; and hourly-batch data &lt;em&gt;needs&lt;/em&gt;. Your job is to figure out which one you're actually solving.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Premature streaming is the new premature optimization. Interviewers increasingly ask "why streaming?" first, not "how would you build it?" 54% of enterprises now run both batch and streaming simultaneously, but the senior move is knowing which workload belongs where. The follow-up question is always about reprocessing: "If you find a bug in your streaming pipeline, how do you reprocess the last 3 months?" If you built Kappa, that replay might be 10-100x slower than a batch Spark job over Parquet files. Mentioning that tradeoff unprompted is the signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Design a backfill strategy that won't corrupt live data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; You discover a bug that corrupted 3 days of data in a production table. Design a reprocessing strategy that fixes the historical data without impacting current pipeline runs or downstream consumers.&lt;/p&gt;

&lt;p&gt;Partition isolation. Backfill writes to a staging partition or shadow table, validated independently, then atomically swaps into production. Each backfill run must be idempotent; same inputs, same outputs, same partition target. The critical detail: your backfill code path must be the &lt;em&gt;same&lt;/em&gt; code path as live ingestion. If reprocessing uses a different path, you risk double-counting or schema mismatches between the two.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Backfill is the maturity test, not an afterthought. Bad backfills have corrupted months of analytics and destroyed user trust. Airflow has a documented race condition in HA mode where &lt;code&gt;max_active_runs=1&lt;/code&gt; can still allow concurrent DAG runs when run count exceeds 500. The interviewer is also checking resource awareness: a backfill DAG that consumes all pool slots starves your critical production DAGs. That's not misconfiguration; that's Airflow's FIFO scheduling by design.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Where do you place data quality checks in a pipeline?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; You have a four-stage pipeline: ingest, transform, aggregate, publish. Where do you put quality assertions, and what happens when they fail?&lt;/p&gt;

&lt;p&gt;Blocking checks on critical columns (primary keys, join keys, non-null constraints) at the ingest layer. Stop bad data at the door. Non-blocking warnings on distribution anomalies (row counts, value ranges, cardinality shifts) at the transform and aggregate layers. Never block a pipeline on a soft anomaly; log it, alert on it, investigate later. Hard failures stop publish; soft warnings don't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Organizations average 67 data incidents per month, with 68% requiring 4+ hours to detect. The conventional answer is "check at every layer," but interviews now penalize over-instrumentation. 66% of teams can't keep pace with alert volume, and engagement drops 15% once a channel receives more than 50 alerts per week. The senior answer is fewer, higher-confidence alerts. Target less than 10% false positive rate. Rules above 50% false positive rate are candidates for deletion, even if it means temporary blind spots.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. How do you handle late-arriving data?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Events arrive 2-48 hours after their event timestamp. Your aggregation pipeline runs daily. How do you ensure late arrivals are reflected accurately?&lt;/p&gt;

&lt;p&gt;Separate ingestion windows from processing windows. Write late-arriving records into the partition matching their &lt;em&gt;event time&lt;/em&gt;, not their &lt;em&gt;arrival time&lt;/em&gt;. Use a watermark (a threshold for how late you'll accept data) and reprocess affected partitions when late data lands. The reprocessing must be idempotent: overwrite the partition, don't append.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This exposes whether you understand the difference between event time and processing time at an architectural level. The follow-up is always: "What if your watermark is 48 hours but a record arrives 72 hours late?" The answer isn't "drop it"; it's "route it to a late-arrival queue, reprocess the affected partition on the next run, and alert if the volume is anomalous."&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Design a pipeline with fan-out/fan-in dependencies
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; You have one source that feeds 8 independent transformations, and all 8 must complete before a final aggregation step runs. One of the 8 is consistently 5x slower than the others. How do you design this?&lt;/p&gt;

&lt;p&gt;The slow branch determines your pipeline's clock. Options: optimize the slow task, break it into parallelizable sub-tasks, or (if the slow branch's output is independent enough) decouple it into a separate pipeline with its own SLA. The aggregation step either waits for all 8 or publishes a partial result with a flag indicating incomplete data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Many candidates optimize individual task latency and miss the parallel fan-out bottleneck entirely. One slow downstream branch blocks the fan-in. This is where &lt;strong&gt;architecture&lt;/strong&gt; discipline beats framework knowledge. The follow-up: "What if the slow branch fails? Do you retry, skip, or block?" Each answer reveals different assumptions about data completeness guarantees.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Explain the tradeoffs between Lambda and Kappa architecture
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; When would you choose Lambda over Kappa, and vice versa?&lt;/p&gt;

&lt;p&gt;Kappa (single streaming codebase, replay from the log) is simpler to maintain but brutal for large-scale reprocessing. Lambda (separate batch and speed layers) duplicates code but gives you a batch safety net. Kappa is the mainstream default now; Uber, LinkedIn, Shopify, and Disney run Kappa-style architectures. But Lambda's safety guarantees outweigh Kappa's elegance when reprocessing 2+ years of events is a regular need, because replaying through a streaming engine is often slower and more expensive than running a batch Spark job over Parquet files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Interviewers now assume you know Kappa and ask &lt;em&gt;when Lambda is justified&lt;/em&gt;, not the reverse. Candidates who cite Kappa as the "modern" choice without discussing reprocessing costs haven't run a terabyte-scale backfill. LinkedIn abandoned Lambda explicitly to reduce codebase duplication, but that tradeoff only makes sense if your replay path is fast enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. How do you prevent alert fatigue in pipeline monitoring?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Your team monitors 200 pipelines. Engineers are ignoring alerts. How do you fix this?&lt;/p&gt;

&lt;p&gt;Classify alerts into hard failures (stop publish, page someone) and soft warnings (investigate during business hours). Set a false positive target below 10% and an alert-to-incident conversion rate above 20%. Dynamic thresholds tuned on historical data reduce alert noise by 40-60% in the first month compared to static thresholds. Ruthlessly delete rules with a false positive rate above 50%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Alert fatigue kills more pipelines than lack of monitoring. The production reality is inverted from what textbooks teach: the problem isn't too few alerts, it's too many. Gartner benchmarks show $12.9M per year in organizational losses from poor data quality, and a huge chunk of that is real issues buried under noise that nobody looked at.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. How do you enforce schema contracts across teams?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Three producer teams send data to your pipeline. How do you prevent breaking changes from reaching production?&lt;/p&gt;

&lt;p&gt;Schema registry with compatibility mode (backward, forward, or full) enforced in CI. Producers can't merge a PR that breaks compatibility. Pair this with a deprecation window: fields marked deprecated get a 90-day sunset, consumers are notified, and removal only happens after the window closes. The contract isn't just schema; it's schema plus field semantics plus nullability plus ownership.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Most data contract tools don't actually enforce contracts; they validate them after the fact, with a 35-40% false-negative rate. The senior answer distinguishes detection from enforcement. Detection alerts you after bad data lands. Enforcement blocks the write before it happens. The Open Data Contract Standard (ODCS v3.1, shipped December 2025) is the reference spec.&lt;/p&gt;

&lt;h2&gt;
  
  
  11. Design for exactly-once semantics in a payment processing pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Payment events flow through Kafka into a ledger system. Duplicate charges are unacceptable. How do you guarantee exactly-once processing?&lt;/p&gt;

&lt;p&gt;This is the one case where at-least-once with idempotent sinks isn't enough. Use Kafka transactions (atomic read-process-write) combined with a database UNIQUE constraint on the idempotency key. The idempotency key must be derived from the business event (transaction ID), never from processing metadata. The critical edge case: two concurrent identical requests can both pass the dedup check if you're using a boolean flag. You need an atomic lock; database UNIQUE constraint or Redis SET NX with expiry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; The interviewer is testing whether you know &lt;em&gt;when&lt;/em&gt; exactly-once is worth the overhead. The answer: money, inventory, anything with legal or financial consequences. The follow-up is always the concurrent request race condition. Stripe uses atomic INSERT ... ON CONFLICT as the canonical pattern, and there's a reason: it's the only approach that handles concurrent duplicates correctly at the database level.&lt;/p&gt;

&lt;h2&gt;
  
  
  12. Your pipeline silently dropped 40% of records for six months. How do you find out, and how do you prevent it?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; No alerts fired. Dashboards still loaded. Stakeholders noticed the numbers "looked low" but didn't escalate. Walk through detection and prevention.&lt;/p&gt;

&lt;p&gt;Detection: row-count reconciliation between source and target at every stage, run on a schedule independent of the pipeline itself. Statistical anomaly detection on output volumes (not just "is it zero," but "is it within 2 standard deviations of the trailing 30-day average"). Prevention: publish data quality metrics as a first-class output of the pipeline, visible to stakeholders, not buried in engineering dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Non-idempotent pipelines fail loudly. Almost-idempotent pipelines fail silently. The dashboards still load, the counts look "reasonable," but the numbers are wrong. 73% of teams can detect pipeline failures but have zero visibility into root cause. This question tests whether you've been burned by the silent failure and built the guardrails afterward, or whether you're still designing for the happy path.&lt;/p&gt;

&lt;p&gt;, -&lt;/p&gt;

&lt;p&gt;These 12 &lt;strong&gt;data pipeline interview questions&lt;/strong&gt; cover the territory where senior DE loops are won and lost. The pattern across all of them: the interviewer isn't looking for the "correct" architecture. They're looking for evidence that you've shipped something, watched it break, and fixed it under pressure. The tools change every 18 months. Schema drift, late-arriving data, upstream teams breaking contracts without telling you: those are eternal.&lt;/p&gt;

&lt;p&gt;What's the pipeline architecture question you've been asked that isn't on this list? Drop it in the comments.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>architecture</category>
      <category>interview</category>
      <category>career</category>
    </item>
    <item>
      <title>The 12 Data Modeling Interview Questions that Matter</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Wed, 17 Jun 2026 10:04:25 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/top-12-data-modeling-interview-questions-with-answers-9ja</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/top-12-data-modeling-interview-questions-with-answers-9ja</guid>
      <description>&lt;p&gt;I've watched candidates with 8 years of experience go blank when asked to define the grain of a fact table. Not because they're bad engineers; because nobody told them that data modeling is the actual filter. SQL problems test syntax. System design tests memorization. Data modeling tests whether you can think. That's why it's the section that separates senior from staff, and why interviewers keep leaning on it harder every cycle. AI can spit out a medium LeetCode solution in seconds; it still can't explain why your grain decision breaks downstream aggregates.&lt;/p&gt;

&lt;p&gt;These 12 problems are the ones I've seen repeatedly across FAANG and late-stage startup loops. They cover &lt;strong&gt;star schema&lt;/strong&gt; design, &lt;strong&gt;dimensional modeling&lt;/strong&gt; tradeoffs, SCDs, late-arriving data, and the classification calls that trip up even experienced candidates.&lt;/p&gt;

&lt;p&gt;Want to practice these for real? &lt;a href="https://datadriven.io/data-modeling-interview-questions" rel="noopener noreferrer"&gt;Solve these problems live here&lt;/a&gt; with a real editor and graded solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Define the Grain of a Fact Table
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; You're building an analytics warehouse for a ride-sharing company. Before designing any tables, state the grain of the core fact table. What does one row represent?&lt;/p&gt;

&lt;p&gt;The answer is: one row per completed trip. Not per driver. Not per day. One row per atomic trip event, keyed by &lt;code&gt;trip_id&lt;/code&gt;, with foreign keys to &lt;code&gt;dim_driver&lt;/code&gt;, &lt;code&gt;dim_rider&lt;/code&gt;, &lt;code&gt;dim_pickup_location&lt;/code&gt;, &lt;code&gt;dim_dropoff_location&lt;/code&gt;, and &lt;code&gt;dim_date&lt;/code&gt;. Measures include &lt;code&gt;fare_amount&lt;/code&gt;, &lt;code&gt;tip_amount&lt;/code&gt;, &lt;code&gt;trip_duration_seconds&lt;/code&gt;, &lt;code&gt;trip_distance_miles&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Grain is the single most important decision in dimensional modeling. Candidates who jump into drawing tables without stating "one row represents X" are already drifting. Undefined grain causes silent metric inflation, duplicate rows, and join explosions that don't throw errors; they just produce wrong numbers. Interviewers test this first because everything downstream depends on it. The follow-up is always: "What happens when a trip has multiple stops?" If your grain assumed single-destination trips, you just broke your own schema.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Star Schema vs. Snowflake Schema
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Your team is building a new warehouse on Snowflake (the product). A junior engineer proposes snowflake schema (the design pattern) to save storage. Do you agree? Why or why not?&lt;/p&gt;

&lt;p&gt;You don't agree. &lt;strong&gt;Star schema&lt;/strong&gt; is the default for modern columnar warehouses. Snowflake, BigQuery, and Redshift compress denormalized dimensions so efficiently that snowflaking (normalizing dimensions into sub-tables) rarely saves meaningful storage anymore. The engineering overhead of maintaining normalized dimension hierarchies exceeds the storage cost of duplication. Star is the safe opening position in any interview.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Picking snowflake first signals junior thinking. The economics killed the normalization argument around 2024. Interviewers aren't testing whether you know both patterns exist; they're testing whether you can reason about the tradeoff. The follow-up: "When &lt;em&gt;would&lt;/em&gt; you normalize a dimension?" The answer is when the dimension is enormous and changes frequently (millions of rows, daily updates), making the redundant writes expensive. That's rare.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Design a Fact Table for E-Commerce Orders
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Design a star schema for an e-commerce platform. The business needs to track orders at the line-item level for revenue analysis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;fact_order_line_item&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_line_item_sk&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt;            &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;degenerate&lt;/span&gt; &lt;span class="n"&gt;dimension&lt;/span&gt;
    &lt;span class="n"&gt;product_sk&lt;/span&gt;          &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;dim_product&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_sk&lt;/span&gt;         &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;date_sk&lt;/span&gt;             &lt;span class="nb"&gt;INT&lt;/span&gt;    &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;dim_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantity&lt;/span&gt;            &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;unit_price&lt;/span&gt;          &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;discount_amount&lt;/span&gt;     &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;line_total&lt;/span&gt;          &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Grain: one row per line item per order. &lt;code&gt;order_id&lt;/code&gt; is a &lt;strong&gt;degenerate dimension&lt;/strong&gt;; it lives in the fact table because it has no descriptive attributes worth a separate table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This tests three things at once. Can you declare grain (line item, not order)? Do you know what a degenerate dimension is? And do you put the right measures in the fact table? Candidates who model at the order grain lose the ability to analyze product-level revenue without restructuring. You can always aggregate up from line items to orders; you can never disaggregate back down.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. SCD Type 2 Implementation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; A customer changes their address. How do you model &lt;code&gt;dim_customer&lt;/code&gt; to preserve the old address for historical reporting?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCD Type 2&lt;/strong&gt;: insert a new row with a new surrogate key, set &lt;code&gt;effective_date&lt;/code&gt; and &lt;code&gt;expiration_date&lt;/code&gt; on both rows, and flag the current row with &lt;code&gt;is_current = TRUE&lt;/code&gt;. The original row stays intact; historical fact rows still join to the old address via the old surrogate key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="k"&gt;Before&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;
&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;customer_sk&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;101&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Jane'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Austin'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;effective&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expiration&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'9999-12-31'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_current&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;TRUE&lt;/span&gt;

&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="k"&gt;After&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;close&lt;/span&gt; &lt;span class="k"&gt;old&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;insert&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;expiration_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2026-06-16'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;FALSE&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_sk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;101&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;102&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Jane'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Denver'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'2026-06-17'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'9999-12-31'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; SCD2 is a separator question. Juniors describe it from the textbook. Seniors bring up the trap: &lt;strong&gt;SCD2 row explosion&lt;/strong&gt;. A dimension with 10M rows tracking frequently changing attributes can balloon to 150M rows in five years. The follow-up is always: "When would you use Type 1 instead?" Answer: when the business doesn't need the history. A corrected typo in a customer name doesn't warrant a new historical row. Type 1 overwrites are often correct, despite Type 2's prestige.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Late-Arriving Dimensions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; An order fact arrives, but the customer who placed it hasn't been loaded into &lt;code&gt;dim_customer&lt;/code&gt; yet. What do you do?&lt;/p&gt;

&lt;p&gt;Insert a placeholder row in &lt;code&gt;dim_customer&lt;/code&gt; with a surrogate key and all descriptive columns set to "Unknown" or null. The fact row joins to this placeholder. When the real customer data arrives, you overwrite the placeholder via Type 1 update.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Late-arriving dimensions and late-arriving facts are entirely different problems, and mixing them up is an instant red flag. This tests whether you understand that the fact table can't wait; it needs a foreign key now. The alternative (dropping the fact row until the dimension arrives) loses data. The follow-up: "What if the dimension arrives with changes?" Then you might need to apply SCD2 logic to the placeholder row, which gets complex fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Bridge Tables for Many-to-Many Relationships
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; A hospital system tracks patient diagnoses. One hospitalization can have multiple diagnoses, and one diagnosis applies to many hospitalizations. How do you model this?&lt;/p&gt;

&lt;p&gt;You use a &lt;strong&gt;bridge table&lt;/strong&gt;. Create a &lt;code&gt;diagnosis_group_key&lt;/code&gt; that maps to a set of diagnoses in &lt;code&gt;bridge_diagnosis&lt;/code&gt;. The fact table (&lt;code&gt;fact_hospitalization&lt;/code&gt;) joins to &lt;code&gt;diagnosis_group_key&lt;/code&gt;; the bridge table resolves each group to individual &lt;code&gt;dim_diagnosis&lt;/code&gt; rows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Many-to-many relationships in dimensional models are the source of the most dangerous bug in analytics: double-counting. Without a bridge table, a naive join between fact and dimension multiplies rows. Interviewers use this to test whether you understand the cardinality trap. The follow-up: "How do you handle weighting?" If a hospitalization has three diagnoses, does each get 1/3 of the revenue allocation? Bridge tables can carry a &lt;code&gt;weight_factor&lt;/code&gt; column for exactly this.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Fact vs. Dimension Classification
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; You have a column &lt;code&gt;customer_lifetime_revenue&lt;/code&gt;. Is it a fact or a dimension attribute?&lt;/p&gt;

&lt;p&gt;It's both, depending on usage. If you're summing it across rows, it's a fact. If you're banding it into ranges ("$0-$1K", "$1K-$10K") to filter or group by, it's a dimension attribute. Kimball calls this the &lt;strong&gt;aggregated-fact-as-attribute&lt;/strong&gt; pattern.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you would aggregate the column, it's a fact. If you would filter or group by it, it's a dimension. That's the whole test.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This exposes whether a candidate understands that the fact/dimension boundary isn't about data types. Numeric columns don't automatically belong in fact tables. The follow-up: "Where do you physically store it?" Usually in the dimension, banded into a descriptive range, with the raw number available as an additive fact if needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Factless Fact Tables
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; The business wants to know which products were NOT sold in each store last month. How do you model this?&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;factless fact table&lt;/strong&gt; (coverage table). One row per store per product per month, representing eligibility. To find products not sold, you subtract the sales fact table from the coverage table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Most candidates have never heard of factless fact tables. The name sounds like a contradiction. But they solve a real problem: you can't report on the absence of an event without first modeling what &lt;em&gt;could&lt;/em&gt; have happened. Student attendance, product availability, promotional eligibility; these all use the same pattern. The follow-up: "Isn't this just a cross join?" Yes, and that's the point. The cross join defines the universe; the anti-join finds the gaps.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Accumulating Snapshot Fact Table
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Model an order fulfillment pipeline with stages: ordered, packed, shipped, delivered.&lt;/p&gt;

&lt;p&gt;One row per order, with multiple date columns: &lt;code&gt;order_date_sk&lt;/code&gt;, &lt;code&gt;pack_date_sk&lt;/code&gt;, &lt;code&gt;ship_date_sk&lt;/code&gt;, &lt;code&gt;delivery_date_sk&lt;/code&gt;. The row gets updated as the order progresses through stages. Null dates indicate incomplete milestones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is the "advanced grain" question. Most candidates know transaction facts and periodic snapshots; &lt;strong&gt;accumulating snapshots&lt;/strong&gt; trip them up because the row mutates. The fact table updates in place, which feels wrong if you've been taught that fact tables are append-only. Insurance claims, hiring workflows, procurement cycles; all use this pattern. The follow-up: "How do you handle an order that skips a stage?" That's a null in the milestone column, and your reporting logic needs to handle it.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. Conformed Dimensions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Sales and marketing each have their own &lt;code&gt;dim_customer&lt;/code&gt; table with different definitions. What's the risk, and how do you fix it?&lt;/p&gt;

&lt;p&gt;The risk is the CEO gets two different customer counts. &lt;strong&gt;Conformed dimensions&lt;/strong&gt; are shared across fact tables and business units, with identical keys, attributes, and definitions. You build one &lt;code&gt;dim_customer&lt;/code&gt;, owned by a central data team, and both domains join to it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This tests organizational thinking, not just schema design. Split-brain dimensions are how companies end up with "which number is right?" meetings. The follow-up: "What if the two teams need different attributes?" Add them to the same dimension. A wide dimension with 50 columns that both teams trust is better than two narrow dimensions that contradict each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  11. Normalization vs. Denormalization for Analytics
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; When would you choose a normalized (3NF) model over a denormalized star schema in an analytics warehouse?&lt;/p&gt;

&lt;p&gt;Almost never for the presentation layer. Denormalized schemas achieve 20 to 100x faster query performance on complex analytics workloads by eliminating joins. BigQuery benchmarks show 49% average improvement with fully denormalized tables compared to star schemas. But the staging layer should stay normalized. 3NF in staging preserves flexibility; when requirements change, you can rematerialize the presentation layer without remodeling the entire pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; The real answer is "both, in different layers." Organizations run 3NF in source systems, normalize in staging for integrity, and denormalize in the presentation layer for speed. Candidates who pick one paradigm for the entire warehouse reveal they've never dealt with a schema migration. The follow-up: "What about high-cardinality many-to-many relationships?" Don't denormalize those. A customer/orders/products grain creates explosive row multiplication.&lt;/p&gt;

&lt;h2&gt;
  
  
  12. Late-Arriving Facts and Backfills
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Your daily pipeline processes orders by &lt;code&gt;processing_date&lt;/code&gt;. An order from 10 days ago arrives today. How does your pipeline handle it?&lt;/p&gt;

&lt;p&gt;Partition by &lt;code&gt;event_time&lt;/code&gt; (when the order was placed), not &lt;code&gt;processing_time&lt;/code&gt; (when it arrived). Keep a rolling recompute window open; reprocess the last 14 days on every run. This auto-reconciles normal late arrivals without manual intervention. For data outside the window, run an explicit backfill job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Late data isn't a failure mode; it's the normal case. Most production systems expect 10 to 20% of daily volume to arrive delayed. Candidates who say "drop anything older than 7 days" have never worked on a pipeline that finance depends on. The follow-up: "What if the late fact needs to join to a dimension that has since changed (SCD2)?" You join to the dimension version that was active at event time, not processing time. That's the whole point of surrogate keys and effective dates.&lt;/p&gt;

&lt;p&gt;, -&lt;/p&gt;

&lt;p&gt;Data modeling questions keep showing up because they're the one thing AI can't fake for you. An LLM will produce a schema. It won't explain why that grain breaks when requirements shift, or defend the denormalization when the interviewer pushes back. If you want structured reps on these exact patterns, i used DataDriven for data modeling interview questions and it was the most efficient prep I found for this category.&lt;/p&gt;

&lt;p&gt;Which &lt;strong&gt;data modeling&lt;/strong&gt; interview question would you add to this list? I'm curious what y'all are seeing in loops right now.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>sql</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Top 12 Spark Interview Problems for Data Engineers, With Answers</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Tue, 16 Jun 2026 10:08:49 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/top-12-spark-interview-problems-for-data-engineers-with-answers-4h0e</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/top-12-spark-interview-problems-for-data-engineers-with-answers-4h0e</guid>
      <description>&lt;p&gt;I've been on both sides of the Spark interview table more times than I'd like to admit. The pattern is always the same: candidates can write a &lt;code&gt;groupBy().agg()&lt;/code&gt; in their sleep, but the moment you ask them &lt;em&gt;why&lt;/em&gt; their job spills to disk or &lt;em&gt;where&lt;/em&gt; the shuffle happens in a query plan, things fall apart. Spark interview questions aren't about syntax. They're about whether you understand execution. That's what separates a senior data engineer from someone who copy-pastes PySpark from Stack Overflow.&lt;/p&gt;

&lt;p&gt;These 12 problems are the ones I've seen surface repeatedly in senior DE loops. They cover shuffle, skew, joins, memory, caching, and the optimizer. If you can answer all 12 cold, you're ready. If you can't, now you know where to grind. datadriven.io is great for spark practice if you want reps beyond what's here.&lt;/p&gt;

&lt;p&gt;Want to practice these for real? &lt;a href="https://datadriven.io/tools/spark-interview-questions" rel="noopener noreferrer"&gt;Solve these problems live here&lt;/a&gt; with a real editor and graded solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Identify the Shuffle
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given the following PySpark code, identify which transformations trigger a shuffle (stage boundary) and which do not. Explain why.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://data/events/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1
&lt;/span&gt;&lt;span class="n"&gt;filtered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purchase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2
&lt;/span&gt;&lt;span class="n"&gt;projected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filtered&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3
&lt;/span&gt;&lt;span class="n"&gt;grouped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;projected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4
&lt;/span&gt;&lt;span class="n"&gt;sorted_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;grouped&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sum(amount)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;sorted_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Steps 1 and 2 are &lt;strong&gt;narrow transformations&lt;/strong&gt;: filter and select each operate on one input partition with zero data movement. Steps 3 and 4 are &lt;strong&gt;wide transformations&lt;/strong&gt;: &lt;code&gt;groupBy&lt;/code&gt; redistributes rows by key across executors, and &lt;code&gt;orderBy&lt;/code&gt; requires a global sort, both crossing stage boundaries. The trap is that candidates say "joins are wide" as a blanket rule but forget that &lt;code&gt;orderBy&lt;/code&gt; is also wide. The follow-up interviewers push on: "How many stages does this job have?" Answer: three. One for the read+filter+select, one for the groupBy, one for the orderBy. Each shuffle is a stage boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Tune Shuffle Partitions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Your Spark job processes 50 GB of data. The default &lt;code&gt;spark.sql.shuffle.partitions&lt;/code&gt; is 200. What's wrong, and what value would you set instead? Show the config change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.shuffle.partitions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;400&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 50 GB / 400 partitions = 125 MB per partition (under the 128 MB target)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; The &lt;strong&gt;default 200 shuffle partitions&lt;/strong&gt; is cargo-cult tuning. It originated from empirical tests on specific cluster sizes and was never meant as a universal constant. At 200 partitions, 50 GB means 250 MB per partition, which exceeds the recommended 128 MB target and risks executor spill. The fix is simple division: target size under 128 MB, so 400 partitions puts you at 125 MB each. Retuning shuffle partitions to match actual data volume can deliver roughly 40% performance improvement without changing cluster size. Interviewers probe whether you can do this math on the spot or whether you just memorize "set it to 200."&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Force a Broadcast Join
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Table &lt;code&gt;orders&lt;/code&gt; has 500 million rows. Table &lt;code&gt;regions&lt;/code&gt; has 5,000 rows (2 MB). Write a join that avoids shuffle entirely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;broadcast&lt;/span&gt;

&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://data/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;regions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://data/regions/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;regions&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;regions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; The &lt;strong&gt;broadcast join threshold&lt;/strong&gt; defaults to 10 MB (&lt;code&gt;spark.sql.autoBroadcastJoinThreshold&lt;/code&gt;). Since &lt;code&gt;regions&lt;/code&gt; is 2 MB, Spark would auto-broadcast it here. But the real interview signal is: do you know that a broadcast join is a &lt;em&gt;narrow&lt;/em&gt; transformation despite being a join? No shuffle occurs; Spark sends the small table to every executor. Most candidates conflate "join" with "wide transformation," and this is now a standard trick question on Databricks loops. The follow-up: "What happens if you force a broadcast hint on a 2 GB table?" Answer: it collects the data to the driver and likely causes an OOM error. The hard limit for broadcast is roughly 8 GB.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Handle Data Skew with Salting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Your join on &lt;code&gt;customer_id&lt;/code&gt; takes 4 hours because one customer has 50 million rows while the median is 500. AQE is disabled (legacy Spark 2.4 cluster). Fix the skew.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;explode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;concat&lt;/span&gt;

&lt;span class="n"&gt;SALT_BUCKETS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;

&lt;span class="c1"&gt;# Salt the large (skewed) side
&lt;/span&gt;&lt;span class="n"&gt;orders_salted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;SALT_BUCKETS&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;int&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;join_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Replicate the small side across all salt values
&lt;/span&gt;&lt;span class="n"&gt;salt_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SALT_BUCKETS&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;customers_replicated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;crossJoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salt_range&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;join_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders_salted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customers_replicated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;join_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; A single key with 50M+ rows causes &lt;code&gt;collect_list()&lt;/code&gt; or join aggregations to build massive data structures in one executor. Classic OOM. Salt-based skew handling (append random integer to the hot key, replicate the small side to match) is the standard production pattern tested in senior data engineering loops. The trap: candidates who jump straight to salting without first asking "is AQE enabled?" reveal they don't check what Spark already does for them. On Spark 3.2+, AQE's &lt;code&gt;spark.sql.adaptive.skewJoin.enabled&lt;/code&gt; splits skewed partitions automatically. The salting question specifically tests whether you understand the &lt;em&gt;mechanism&lt;/em&gt; AQE automates.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Predict AQE Behavior
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; AQE is enabled (Spark 3.2+). You have a sort-merge join where the left side is 200 GB and the right side ends up being 8 MB after filters. What does AQE do at runtime, and what config controls this?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.adaptive.enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.autoBroadcastJoinThreshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10485760&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 10 MB
&lt;/span&gt;
&lt;span class="c1"&gt;# AQE will convert the sort-merge join to a broadcast join at runtime
# because the right side (8 MB) is below the broadcast threshold
# after shuffle statistics reveal the actual size
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; &lt;strong&gt;Adaptive Query Execution&lt;/strong&gt; dynamically reoptimizes at runtime using shuffle statistics. It does three things: coalesces small post-shuffle partitions, switches sort-merge joins to broadcast when one side proves small, and splits skewed partitions. The key insight is that Catalyst's static plan might choose sort-merge because it can't predict the right side will shrink to 8 MB after filters. AQE waits for the shuffle output, sees 8 MB, and switches strategies mid-execution. The follow-up that catches candidates: "Does AQE work with Structured Streaming?" No. AQE requires materialization points (pausing after a shuffle to analyze stats), which streaming doesn't have.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Read a Catalyst Plan
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; You run &lt;code&gt;df.explain(True)&lt;/code&gt; and see that a filter on &lt;code&gt;status = 'active'&lt;/code&gt; appears &lt;em&gt;before&lt;/em&gt; the scan in the optimized logical plan, even though you wrote it after the join in your code. What happened?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Your code order:
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;active&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# But explain(True) shows filter pushed down before the join
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;explain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; &lt;strong&gt;Catalyst's 4-stage pipeline&lt;/strong&gt; (analysis, logical optimization, physical planning, code generation) applies &lt;strong&gt;predicate pushdown&lt;/strong&gt; during logical optimization. It moves filters as early as possible to reduce data volume before expensive operations like joins. Only stage 3 (physical planning) is cost-based; stages 1, 2, and 4 are rule-based. The trap: candidates who break transformation chains with intermediate actions (like &lt;code&gt;cache()&lt;/code&gt; or &lt;code&gt;count()&lt;/code&gt;) prevent Catalyst from seeing the full chain, which disables these optimizations. More transformations in a single chain give Spark more opportunities to optimize, not fewer. That's counterintuitive, and it's exactly what interviewers are testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Window Function Partition Skew
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Your window function OOMs on one executor while 63 others sit idle. Diagnose and fix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.window&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;

&lt;span class="c1"&gt;# BAD: user_id "bot_account" has 90% of rows
&lt;/span&gt;&lt;span class="n"&gt;window_spec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_spec&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# FIX: composite partition key spreads load
&lt;/span&gt;&lt;span class="n"&gt;window_spec_fixed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_ranked_fixed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_spec_fixed&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Window functions don't intrinsically cause OOM. Data skew does. A &lt;code&gt;partitionBy("user_id")&lt;/code&gt; where one user has 90% of the rows concentrates that entire partition on one executor. The window function itself is blameless. This shifts the interview from "how to fix the window" to "how to detect and rebalance upstream data." The follow-up: &lt;code&gt;row_number()&lt;/code&gt; without a tiebreaker column in &lt;code&gt;orderBy&lt;/code&gt; produces nondeterministic results across executor restarts. If your tiebreaker isn't unique, the ranking is undefined. Silent data bugs.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Cache Storage Level Trade-off
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; You have a DataFrame used three times in your pipeline. It involves an expensive join. Choose the right storage level and explain why &lt;code&gt;cache()&lt;/code&gt; alone isn't enough.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StorageLevel&lt;/span&gt;

&lt;span class="n"&gt;expensive_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;regions&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# cache() is lazy; it does nothing until an action fires
&lt;/span&gt;&lt;span class="n"&gt;expensive_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StorageLevel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MEMORY_AND_DISK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;expensive_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# triggers the actual caching
&lt;/span&gt;
&lt;span class="c1"&gt;# ... use expensive_df three more times ...
&lt;/span&gt;
&lt;span class="n"&gt;expensive_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unpersist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# release memory when done
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; &lt;code&gt;cache()&lt;/code&gt; is a &lt;strong&gt;lazy transformation, not an action&lt;/strong&gt;. Calling &lt;code&gt;df.cache()&lt;/code&gt; does nothing until you trigger an action like &lt;code&gt;count()&lt;/code&gt;. This is the single most common Spark caching mistake. Beyond that: RDD &lt;code&gt;cache()&lt;/code&gt; defaults to &lt;code&gt;MEMORY_ONLY&lt;/code&gt;, but DataFrame &lt;code&gt;cache()&lt;/code&gt; defaults to &lt;code&gt;MEMORY_AND_DISK&lt;/code&gt;. When a partition doesn't fit in memory, disk I/O is often faster than recomputing an expensive join from scratch. &lt;code&gt;MEMORY_ONLY&lt;/code&gt; is only correct if you've confirmed the dataset fits entirely in executor memory. The follow-up that separates senior from staff: "When does caching hurt performance?" Answer: intermediate &lt;code&gt;cache()&lt;/code&gt; calls break the transformation chain, preventing Catalyst from reordering expensive operations. Cache only when the DataFrame is used multiple times &lt;em&gt;and&lt;/em&gt; the recomputation cost exceeds the serialization cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Diagnose an Executor OOM
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Your executor crashes with &lt;code&gt;java.lang.OutOfMemoryError: Java heap space&lt;/code&gt;. Executor memory is set to 8 GB. The job processes 100 GB with 800 shuffle partitions. Walk through your debugging steps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.executor.memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8g&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.executor.memoryOverhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2g&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;# 20% for off-heap
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.memory.fraction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# 60% for unified memory
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.memory.storageFraction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# 50/50 split
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.shuffle.partitions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;800&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# 125 MB target
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.executor.cores&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                   &lt;span class="c1"&gt;# 5 cores per executor
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Spark reserves a hardcoded 300 MB from executor heap for internal bookkeeping. Of the remaining 7.7 GB, 60% (4.6 GB) goes to &lt;strong&gt;Unified Memory&lt;/strong&gt;, split between execution (shuffles, joins, sorts) and storage (cached data). With 5 cores per executor, each task gets roughly 920 MB of unified memory. At 125 MB per partition, that's fine for most operations, but a skewed partition or a large broadcast variable blows past it. The counterintuitive trap: "just add more memory" is backwards. A 32 GB heap with routine GC can pause for 74 seconds out of 120 seconds of execution (61% pause time). The real lever is partition count and memory fraction tuning, not heap size. Interviewers love this because it inverts junior intuitions.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. Join Type Selection Under Constraints
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Table A is 300 GB. Table B is 500 MB. You have 10 executors with 8 GB each. Which join strategy, and why not the other two?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 500 MB exceeds default 10 MB threshold, so increase it
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.autoBroadcastJoinThreshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;536870912&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 512 MB
&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;table_a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_b&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Three options: broadcast, sort-merge, shuffle-hash. Sort-merge shuffles both sides (300 GB + 500 MB over the network). Shuffle-hash shuffles both sides and builds hash tables. Broadcast sends 500 MB to each executor (5 GB total network, no shuffle of the 300 GB side). With 8 GB per executor, 500 MB fits comfortably. The default 10 MB threshold is "extremely conservative" per production guidance; real workloads typically increase it to 100-500 MB. The follow-up that catches people: "What if Table B is 500 MB but has a type mismatch on the join key?" A &lt;code&gt;string&lt;/code&gt; joined to an &lt;code&gt;int&lt;/code&gt; silently produces empty results. Seasoned engineers check types before blaming join strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  11. Eliminate an Unnecessary Shuffle
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; You have two window functions over the same DataFrame with the same &lt;code&gt;partitionBy&lt;/code&gt; but different &lt;code&gt;orderBy&lt;/code&gt;. How many shuffles happen, and can you reduce it?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.window&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;spark_sum&lt;/span&gt;

&lt;span class="n"&gt;w1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dept_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;w2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dept_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hire_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Two windows, same partitionBy, different orderBy: one shuffle, two sorts
&lt;/span&gt;&lt;span class="n"&gt;df_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary_rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tenure_rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Spark only shuffles on &lt;code&gt;partitionBy&lt;/code&gt; boundaries. Since both windows partition by &lt;code&gt;dept_id&lt;/code&gt;, there's one shuffle (one repartition by &lt;code&gt;dept_id&lt;/code&gt;) followed by two local sorts within each partition. Candidates who say "two windows = two shuffles" reveal they don't understand stage boundaries. The follow-up: if you change w2 to &lt;code&gt;partitionBy("region_id")&lt;/code&gt;, now you get two shuffles because the partition keys differ. Reducing shuffle stages is underrated in interviews; a query with fewer stage boundaries can outperform a multi-stage query by orders of magnitude because stage synchronization and shuffle materialization are the true bottleneck, not the partition count.&lt;/p&gt;

&lt;h2&gt;
  
  
  12. AQE Partition Coalescing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; After a &lt;code&gt;groupBy&lt;/code&gt;, your job produces 200 shuffle partitions, but only 15 have data (the rest are empty). AQE is enabled. What happens, and what config controls it?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.adaptive.enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.adaptive.coalescePartitions.enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.adaptive.advisoryPartitionSizeInBytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;134217728&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 128 MB
&lt;/span&gt;
&lt;span class="c1"&gt;# AQE merges 185 empty/tiny partitions into ~15 real ones
# reducing task scheduling overhead dramatically
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;spark_sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; 185 empty partitions means 185 tasks that do nothing but consume scheduling overhead. AQE's partition coalescing merges small post-shuffle outputs into fewer, right-sized partitions. But here's the critical misconception: &lt;strong&gt;AQE can only reduce partitions, never increase them.&lt;/strong&gt; If you start with too few partitions (say, 10 for 100 GB of data), AQE sits idle while your executors OOM on 10 GB partitions. You still need to set &lt;code&gt;spark.sql.shuffle.partitions&lt;/code&gt; high enough for AQE to have material to coalesce. TPC-DS benchmarks showed up to 8x speedup on specific queries with AQE, but only when the initial partition count gave AQE room to work.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The tools change every 18 months. The problems don't change. Shuffle, skew, memory, upstream teams breaking contracts without telling you. These are eternal. Learn the concepts; the syntax is the easy part.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which spark interview problem would you add to this list? I'm always curious which scenario questions are showing up in loops right now.&lt;/p&gt;

</description>
      <category>spark</category>
      <category>bigdata</category>
      <category>dataengineering</category>
      <category>interview</category>
    </item>
    <item>
      <title>Top 12 SQL Interview Problems for Data Engineers, With Answers</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Sat, 13 Jun 2026 20:04:22 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/top-12-sql-interview-problems-for-data-engineers-with-answers-4698</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/top-12-sql-interview-problems-for-data-engineers-with-answers-4698</guid>
      <description>&lt;p&gt;I've been on both sides of the SQL interview table somewhere around 20 times in a single job search. The pattern is always the same: candidates know SELECT, they know JOIN, they can write a WHERE clause. Then the interviewer asks them to deduplicate a table, find missing rows, or sessionize an event stream, and the wheels come off. Not because the syntax is hard. Because candidates don't understand what each row represents before they start writing.&lt;/p&gt;

&lt;p&gt;32% of data engineering interview questions test GROUP BY. INNER JOIN sits at 29%, PARTITION BY at 21%, ROW_NUMBER at 15%. These twelve problems cover roughly 80% of what interviewers actually ask. Master them and you're ahead of most candidates walking into a live coding round.&lt;/p&gt;

&lt;p&gt;Want to practice these for real? &lt;a href="https://datadriven.io/sql-interview-questions" rel="noopener noreferrer"&gt;Solve these problems live here&lt;/a&gt; with a real editor and graded solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Customers Who Spent Over $500
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given an &lt;code&gt;orders&lt;/code&gt; table with columns &lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;order_date&lt;/code&gt;, and &lt;code&gt;amount&lt;/code&gt;, find all customers whose total spend exceeds $500. Return &lt;code&gt;customer_id&lt;/code&gt; and &lt;code&gt;total_spent&lt;/code&gt;, sorted by total descending.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_spent&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_spent&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; 70% of candidates confuse WHERE and HAVING. SQL's execution order runs FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY. You cannot filter on an aggregate in WHERE because the aggregate doesn't exist yet. Putting &lt;code&gt;SUM(amount) &amp;gt; 500&lt;/code&gt; in WHERE is a parse error in PostgreSQL. The interviewer isn't testing whether you know HAVING exists; they're testing whether you understand &lt;em&gt;when&lt;/em&gt; each clause runs. If you can state the execution order aloud before writing, you'll never make this mistake.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Double-Counting Trap
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given &lt;code&gt;orders(order_id, customer_id, order_date)&lt;/code&gt; and &lt;code&gt;order_items(item_id, order_id, product_id, quantity, price)&lt;/code&gt;, find each customer's total revenue and total number of orders. An order can have multiple items.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;order_totals&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;order_revenue&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;order_items&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;num_orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;order_totals&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; 80% of candidates fail the aggregation problem at Meta &lt;strong&gt;data engineering&lt;/strong&gt; interviews, typically due to double-counting. If you join orders to order_items and then do &lt;code&gt;COUNT(DISTINCT order_id)&lt;/code&gt; alongside &lt;code&gt;SUM(quantity * price)&lt;/code&gt; in one pass, the count might look right but you've still joined at the wrong grain. The fix is CTEs: aggregate at each grain level separately, then join the results. Strong candidates use CTEs defensively. Weaker ones skip them and wonder why the revenue number is 3x too high.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Latest Record Per Customer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given a &lt;code&gt;customer_updates&lt;/code&gt; table with &lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;updated_at&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;, and &lt;code&gt;status&lt;/code&gt;, return only the most recent row per customer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
               &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
               &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
           &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_updates&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This pattern shows up in 80% of analyst and DE interview loops. The trap is candidates who reach for &lt;code&gt;GROUP BY customer_id&lt;/code&gt; with &lt;code&gt;MAX(updated_at)&lt;/code&gt;, then realize they can't retrieve the email and status columns without aggregating them too. &lt;strong&gt;Window functions&lt;/strong&gt; preserve the full row while computing per-group statistics, which is exactly why interviewers frame this as "try GROUP BY first, then reframe." The follow-up they'll push: "What happens when two rows have the same &lt;code&gt;updated_at&lt;/code&gt;?" Your ROW_NUMBER is non-deterministic without a tiebreaker. Add a secondary sort column. Always.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Deterministic Deduplication With Tie-Breaking
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Same table as above, but now two updates can arrive at the exact same timestamp. Write a query that always returns the same row for a given customer, regardless of physical storage order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
               &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
               &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;ctid&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Postgres&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;use&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;surrogate&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;sequence&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;
           &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_updates&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Most candidates write the ROW_NUMBER query from Problem 3 and stop. Real production deduplication adds a secondary sort: &lt;code&gt;ORDER BY updated_at DESC, source_sequence_number DESC&lt;/code&gt; to ensure the same row wins every run. Without it, two executions on different database instances can return different results. Interviewers probe this when your first answer looks clean. The lesson: ROW_NUMBER is deterministic in its assignment but non-deterministic in its &lt;em&gt;choice&lt;/em&gt; when ties exist in the ORDER BY. Name the tiebreaker or admit the query is unstable.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Customers With No Orders (Anti-Join)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given &lt;code&gt;customers(customer_id, name)&lt;/code&gt; and &lt;code&gt;orders(order_id, customer_id, order_date)&lt;/code&gt;, find all customers who have never placed an order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Anti-joins expose whether you understand missingness. The LEFT JOIN plus IS NULL pattern finds rows in A that have no match in B. Interviewers test this because funnel leaks hide in production analytics; "which users signed up but never converted?" is exactly this query. The follow-up is always: "Would NOT EXISTS work?" Yes, and it's semantically safer when join keys might be nullable. Modern query planners treat both identically, so the choice is about clarity, not performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. NOT IN vs NOT EXISTS With NULLs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given the same tables, explain why this query returns zero rows when the &lt;code&gt;orders&lt;/code&gt; table contains a NULL &lt;code&gt;customer_id&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;This&lt;/span&gt; &lt;span class="n"&gt;silently&lt;/span&gt; &lt;span class="k"&gt;returns&lt;/span&gt; &lt;span class="n"&gt;ZERO&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="k"&gt;any&lt;/span&gt; &lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="n"&gt;has&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;Safe&lt;/span&gt; &lt;span class="k"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;handles&lt;/span&gt; &lt;span class="n"&gt;NULLs&lt;/span&gt; &lt;span class="n"&gt;correctly&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is a silent killer in &lt;strong&gt;sql interview&lt;/strong&gt; rounds. When a subquery contains even one NULL, NOT IN returns zero rows because every comparison against NULL evaluates to UNKNOWN. The query runs without error. It returns a result set. The result set is wrong. Candidates who default to NOT EXISTS signal they've been burned by this in production. Candidates who reach for NOT IN signal they haven't.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Recursive CTE Org Chart
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given &lt;code&gt;employees(employee_id, name, manager_id)&lt;/code&gt;, find all employees (direct and indirect reports) under employee_id = 1. Return employee_id, name, and depth level.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;RECURSIVE&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;employee_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;

    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employee_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; JPMorgan explicitly asks this. Recursive CTEs separate candidates who understand anchor members, recursive members, and termination conditions from those who guess. The recursive step stops when it returns zero rows; if your data has circular references (employee A reports to B, B reports to A), you get an infinite loop. Mention this unprompted and you signal production experience. Ironically, recursive CTEs are vanishingly rare in production code because most hierarchies are shallow (3 to 5 levels) and materialized paths are faster. Interviewers test it as a proxy for "can you think recursively?" not "will you ship this?"&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Self-Join for Consecutive-Day Logins
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given &lt;code&gt;logins(user_id, login_date)&lt;/code&gt; (deduplicated to one row per user per day), find users who logged in on at least 3 consecutive days. Return user_id and the start date of each streak.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;islands&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;login_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;login_date&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
               &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;login_date&lt;/span&gt;
           &lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;grp&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;logins&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;login_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;streak_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;streak_length&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;islands&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;grp&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is the classic &lt;strong&gt;gaps and islands&lt;/strong&gt; problem disguised as a product question. The ROW_NUMBER minus date trick works because consecutive dates produce the same constant when you subtract a sequential integer. Non-consecutive dates break the constant, creating a new group. Interviewers love it because candidates who haven't seen the pattern struggle to invent it under pressure, and candidates who have seen it still need to explain &lt;em&gt;why&lt;/em&gt; it works. The follow-up: "What if login_date has duplicates?" Your entire approach breaks. Deduplicate first or use DENSE_RANK instead of ROW_NUMBER.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. ROW_NUMBER vs RANK vs DENSE_RANK
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given &lt;code&gt;sales(salesperson_id, region, revenue)&lt;/code&gt;, rank each salesperson within their region by revenue. Show how results differ across all three ranking functions when two salespeople tie.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;salesperson_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;       &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rnk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;drnk&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;Ties&lt;/span&gt; &lt;span class="k"&gt;at&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt; &lt;span class="n"&gt;gives&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arbitrary&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;RANK&lt;/span&gt; &lt;span class="n"&gt;gives&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="n"&gt;skips&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt; &lt;span class="n"&gt;gives&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; 80% of &lt;strong&gt;data engineering&lt;/strong&gt; interviews include &lt;strong&gt;window functions&lt;/strong&gt; questions, yet candidates report feeling least prepared for them. The distinction is semantic, not performance; all three execute at similar speeds in modern query planners. ROW_NUMBER for arbitrary tie-breaking (deduplication). RANK when gaps after ties are acceptable (competition scoring). DENSE_RANK when you need continuous ranks. A real incident: a product ranking dashboard showed positions 1,2,3,4,5,6,6,8 because RANK skipped 7 after a tie. Leadership spent half a day troubleshooting before anyone realized the function was &lt;em&gt;correct&lt;/em&gt; but the business needed DENSE_RANK. The choice isn't about data; it's about what the business question actually needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. Running Total With Correct Window Frame
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given &lt;code&gt;transactions(account_id, txn_date, amount)&lt;/code&gt;, compute a running balance per account ordered by transaction date. Then explain why LAST_VALUE often returns unexpected results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
           &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt;
           &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;
           &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="n"&gt;UNBOUNDED&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;CURRENT&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;
       &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;running_balance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; The default window frame when ORDER BY is present is &lt;code&gt;RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW&lt;/code&gt;, which treats ties as a group. ROWS treats each row individually. For running totals this distinction matters when two transactions share the same date. The buried edge case: LAST_VALUE without &lt;code&gt;ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING&lt;/code&gt; returns the current row, not the last row in the partition. 70% of first attempts get this wrong. Interviewers test frame specifications because they separate candidates who memorized &lt;code&gt;OVER(PARTITION BY...)&lt;/code&gt; from those who actually understand how window boundaries work.&lt;/p&gt;

&lt;h2&gt;
  
  
  11. Find the Gap in a Sequence
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given &lt;code&gt;seat_reservations(seat_number)&lt;/code&gt; containing occupied seats (integers, not necessarily consecutive), find all gaps (unreserved ranges). Return &lt;code&gt;gap_start&lt;/code&gt; and &lt;code&gt;gap_end&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;boundaries&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;seat_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;LEAD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seat_number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;seat_number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;next_seat&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;seat_reservations&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;seat_number&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;gap_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;next_seat&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;    &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;gap_end&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;boundaries&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;next_seat&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;seat_number&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Gaps and islands questions slow candidates down because they don't look like standard "join and group by" prompts. The business story sounds custom; the right answer takes a couple of structured steps. LEAD() compares each row to its successor; any difference greater than 1 is a gap. The interviewer will push on edge cases: what about the gap before the first seat? What about the gap after the last? Those require knowing the full domain (e.g., seats 1 through 100) and adding boundary rows. Pattern recognition, not syntax, is the bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  12. Sessionize an Event Stream
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given &lt;code&gt;page_views(user_id, event_time, page_url)&lt;/code&gt;, assign a session_id to each event. A new session starts when the gap between consecutive events for the same user exceeds 30 minutes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;flagged&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;event_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
               &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event_time&lt;/span&gt;
           &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'30 minutes'&lt;/span&gt;
           &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;new_session&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;page_views&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;sessioned&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
               &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event_time&lt;/span&gt;
           &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;flagged&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sessioned&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is the hardest pattern most &lt;strong&gt;sql interview&lt;/strong&gt; rounds test. It combines LAG for gap detection, CASE for flag generation, and cumulative SUM for group assignment; three window functions in one query. The first event per user has a NULL lag, which the CASE treats as 0 (no new session), which is correct because it belongs to session 0. If you don't think through that NULL explicitly, your session IDs are off by one. Interviewers also ask: "What if the 30-minute threshold should vary by user?" Now you need a join to a config table inside the CASE expression. That's where it gets real.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The tools change every 18 months. The problems don't change. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal. And so is SQL.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These twelve problems aren't a random sample. They're the patterns I've seen repeated across FAANG loops, fintech screens, and startup onsites for years. The syntax is the easy part; the hard part is understanding grain, execution order, and what happens when your data has NULLs, ties, or gaps you didn't expect. If you can solve all twelve cold, you're solid. If you want structured reps with graded feedback, datadriven.io is good for sql interview questions like these.&lt;/p&gt;

&lt;p&gt;What's the one SQL problem you'd add to this list? The pattern that tripped you up in a real interview that none of the prep guides cover?&lt;/p&gt;

</description>
      <category>sql</category>
      <category>database</category>
      <category>interview</category>
      <category>career</category>
    </item>
    <item>
      <title>Top 12 Python Interview Problems for Data Engineers, With Answers</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 11 Jun 2026 10:04:22 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/top-12-python-interview-problems-for-data-engineers-with-answers-54mh</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/top-12-python-interview-problems-for-data-engineers-with-answers-54mh</guid>
      <description>&lt;p&gt;I've sat on both sides of the &lt;strong&gt;python interview&lt;/strong&gt; table more times than I can count. Somewhere around 20+ loops in a single job search. Here's what I've learned: the problems that actually separate candidates in &lt;strong&gt;data engineering&lt;/strong&gt; screens aren't LeetCode hards. They're medium-difficulty Python problems reframed around pipeline logic, messy data, and memory trade-offs.&lt;/p&gt;

&lt;p&gt;The Python portion of a DE loop is fundamentally different from a backend SWE loop. It's a pipeline-correctness loop. You're not being asked if you can write the code. You're being asked if you'd approve this code at 3 AM on a pipeline you own. These 12 &lt;strong&gt;coding interview&lt;/strong&gt; problems cover the patterns I've seen repeatedly across FAANG, Databricks, Stripe, and mid-market companies. Every one tests a concept that transfers across tools.&lt;/p&gt;

&lt;p&gt;Want to practice these for real? &lt;a href="https://datadriven.io/python-interview-questions" rel="noopener noreferrer"&gt;Solve these problems live here&lt;/a&gt; with a real editor and graded solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Parse Log Lines Into Structured Records
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given a list of log strings in the format &lt;code&gt;"2026-06-11 14:23:01 ERROR db_connection timeout after 30s"&lt;/code&gt;, parse each line into a dict with keys &lt;code&gt;timestamp&lt;/code&gt;, &lt;code&gt;level&lt;/code&gt;, and &lt;code&gt;message&lt;/code&gt;. Some lines are malformed (missing fields, extra whitespace). Return only valid records.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;VALID_LEVELS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DEBUG&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;WARNING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ERROR&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CRITICAL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time_str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parts&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;VALID_LEVELS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time_str&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;level&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; String parsing is the gatekeeper problem in DE screens. Interviewers hand you messy input on purpose. The trap is coding the happy path and ignoring malformed lines. Quarantining bad rows instead of crashing is the single most senior-signaling habit you can show. The follow-up is always: "What if this file is 10GB?" Answer: swap the list for a generator. Which brings us to...&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Stream a Large File With a Generator
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Rewrite the log parser above to handle a 10GB file using O(1) memory. Accept a file path, yield one parsed record at a time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DEBUG&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;WARNING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ERROR&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CRITICAL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time_str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parts&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time_str&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;level&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; For 100 million integers, a list consumes ~411 MB while a generator uses 208 bytes. That's a 1.97 million-fold difference. Interviewers don't just want you to know that; they want you to reach for it instinctively. The gotcha they'll push: "What happens if you iterate this generator twice?" It's exhausted after one pass. If you need multiple passes, you need to recreate it or cache results. Knowing &lt;em&gt;when not&lt;/em&gt; to use a generator is what separates memorization from understanding.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Count Event Frequencies
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given a list of event strings &lt;code&gt;["click", "view", "click", "purchase", "view", "view"]&lt;/code&gt;, return the K most frequent events and their counts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;top_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;most_common&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is three lines. Most candidates write fifteen. &lt;code&gt;Counter.most_common(k)&lt;/code&gt; uses a heap internally, giving you O(n + k log n) without building one yourself. The real test is whether you know the API exists. The follow-up: "What's the space complexity of your frequency map?" Most people say O(n). It's O(u) where u is unique elements. If your input is user click types, u is tiny; if it's session IDs, u could be enormous. That distinction matters in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Deduplicate Records by Composite Key
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given a list of event dicts with keys &lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;event_type&lt;/code&gt;, and &lt;code&gt;timestamp&lt;/code&gt;, deduplicate by &lt;code&gt;(user_id, event_type)&lt;/code&gt;, keeping only the record with the latest timestamp.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dedup_latest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;latest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;latest&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is one of the five recurring &lt;strong&gt;data engineering interview&lt;/strong&gt; patterns. It tests whether you think in terms of grain. Every pipeline you'll ever build has a grain, and dedup logic follows from it. The follow-up: "What if there are 100M events?" At that scale, storing all unique keys in a dict might blow memory. Mentioning a Bloom filter for approximate dedup or partitioning by key range signals systems thinking.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Top K Most Frequent User IDs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given a stream of event dicts, each with a &lt;code&gt;user_id&lt;/code&gt; field, return the K users with the most events. Do it in better than O(n log n).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;heapq&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;top_k_users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;heapq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nlargest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Naive sorting is O(n log n). A heap gives you O(n log k). When k is 10 and n is 50 million, that's not academic; it's the difference between your job finishing and your job timing out. But here's the real test: candidates who can recite "use a min-heap for top K" often can't explain &lt;em&gt;why&lt;/em&gt; it beats sorting when k is small. Pattern memorization is a liability if you can't reason through the trade-off when the interviewer changes k to n/2.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Multi-Key Sort
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Sort a list of task dicts by &lt;code&gt;priority&lt;/code&gt; (ascending) then by &lt;code&gt;created_at&lt;/code&gt; (descending) for ties.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sort_tasks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;priority&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Python's &lt;code&gt;sorted()&lt;/code&gt; is stable and accepts tuple keys. This is the standard pattern for rank aggregation in &lt;strong&gt;interview questions&lt;/strong&gt; and in production. The negative sign trick only works on numeric fields; for strings, you'd need a second &lt;code&gt;sorted()&lt;/code&gt; call leveraging stability. Interviewers push: "What if &lt;code&gt;created_at&lt;/code&gt; is an ISO string?" Then you parse it or reverse the sort with a two-pass approach. Knowing that Python's sort is stable isn't trivia; it's what makes multi-key sorting work without a custom comparator.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Sliding Window: Maximum Throughput
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given a list of timestamps (sorted integers representing seconds), find the maximum number of events in any 60-second window.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;max_throughput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timestamps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;max_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timestamps&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;timestamps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;timestamps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;max_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;max_count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is O(n) instead of the O(n^2) brute force. Sliding window isn't just an algorithm pattern; it's how you monitor pipeline throughput, detect burst traffic, and build alerting logic. The follow-up is usually about sessionization: "Now group events into sessions where any gap &amp;gt; 30 minutes starts a new session." Same pointer mechanics, different business logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Group and Aggregate With Plain Python
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given records with &lt;code&gt;department&lt;/code&gt; and &lt;code&gt;salary&lt;/code&gt;, compute the average salary per department without pandas.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;avg_salary_by_dept&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;totals&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;totals&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;totals&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]][&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dept&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dept&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;totals&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Candidates who reach for pandas in a vanilla Python round are telling the interviewer they can't think without a framework. The &lt;code&gt;defaultdict&lt;/code&gt; pattern here is the same logic under the hood of &lt;code&gt;GROUP BY&lt;/code&gt;. And the most common failure mode in pandas groupby? Using &lt;code&gt;apply()&lt;/code&gt; instead of &lt;code&gt;agg()&lt;/code&gt;, which is 10-100x slower. Interviewers want to see that you understand the mechanics, not just the API.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Flatten Nested JSON
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given a nested dict like &lt;code&gt;{"a": 1, "b": {"c": 2, "d": {"e": 3}}}&lt;/code&gt;, flatten it to &lt;code&gt;{"a": 1, "b.c": 2, "b.d.e": 3}&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;flatten_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;flatten_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Every data engineer has dealt with APIs that return arbitrarily nested JSON. Real Databricks scenarios demand handling inconsistent fields, shifting data types, and partial records. The recursive approach is clean but blows the stack on deeply nested docs. The follow-up: "Make this iterative." That's where you use a stack. Bonus points if you mention that production systems should schema-validate before flattening, not after.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. Merge Two Sorted Iterators
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given two sorted iterators of integers, merge them into a single sorted iterator without loading either into memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;merge_sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iter_b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter_b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
            &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
            &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter_b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
        &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
        &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter_b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is the merge step of merge sort, but it's also how you combine two sorted log files, two time-series feeds, or two partitions of a distributed pipeline. O(n + m) time, O(1) memory. The iterator-in, iterator-out pattern is exactly what coaching platforms mean when they say "rewrite every problem to accept an iterator and return an iterator." If your first instinct is &lt;code&gt;sorted(list(a) + list(b))&lt;/code&gt;, you just loaded both files into memory. That's a red flag.&lt;/p&gt;

&lt;h2&gt;
  
  
  11. Validate and Quarantine Bad Rows
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Given a CSV-like list of row strings with the header &lt;code&gt;"name,age,email"&lt;/code&gt;, parse each row, validate that &lt;code&gt;age&lt;/code&gt; is a positive integer and &lt;code&gt;email&lt;/code&gt; contains &lt;code&gt;@&lt;/code&gt;, and return two lists: valid records and quarantined rows with the reason.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_rows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarantined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;quarantined&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wrong field count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age_str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;AssertionError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;quarantined&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;invalid age&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;quarantined&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;invalid email&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarantined&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is the problem that exposes whether you've shipped production pipelines. A candidate who writes a five-line happy-path parser and says "done" hasn't touched real CSV files. Quoted commas, multi-line fields, non-standard date formats; these aren't gotchas, they're the job. The follow-up: "Should bad rows halt the pipeline or get logged?" The answer is always logged. Dead-letter queues exist for a reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  12. Chunk a File for Parallel Processing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Write a generator that reads a file in chunks of N lines, yielding each chunk as a list. The last chunk may be smaller than N.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;
                &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is &lt;code&gt;pandas.read_csv(chunksize=N)&lt;/code&gt; from scratch. Interviewers want to see that you understand the mechanics underneath the convenience function. It's also the foundation for fan-out parallelism: yield chunks, distribute to workers. The follow-up hits the real complexity: "What if records span multiple lines?" Now you need a stateful parser, and the problem stops being about chunking and starts being about protocol awareness.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The actual job is less "write a DAG" and more "figure out why this pipeline silently dropped 2M rows last Tuesday." Nobody interviews for that. They interview for Python and SQL. These 12 problems are the closest the &lt;strong&gt;coding interview&lt;/strong&gt; gets to the real thing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every one of these problems tests a concept, not a tool. Data modeling, memory trade-offs, grain awareness, error handling as architecture. The syntax is the easy part. If you can explain &lt;em&gt;why&lt;/em&gt; you chose a generator over a list, &lt;em&gt;why&lt;/em&gt; a heap beats a sort for small K, &lt;em&gt;why&lt;/em&gt; you quarantine instead of crash; you're demonstrating the thinking that gets you hired.&lt;/p&gt;

&lt;p&gt;What problem would you add to this list? I'm curious what patterns are showing up in your loops that aren't covered here.&lt;/p&gt;

</description>
      <category>python</category>
      <category>interview</category>
      <category>career</category>
      <category>programming</category>
    </item>
    <item>
      <title>80% of DE Candidates Use AI on Take-Homes. Companies Can't Stop It.</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Tue, 09 Jun 2026 10:05:55 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/80-of-de-candidates-use-ai-on-take-homes-companies-cant-stop-it-4fo2</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/80-of-de-candidates-use-ai-on-take-homes-companies-cant-stop-it-4fo2</guid>
      <description>&lt;p&gt;I've been on both sides of the hiring table for data engineering roles. I've given take-homes, graded take-homes, argued with other panelists about take-homes, and done my share of them as a candidate. So when I tell you the entire system is broken in a way nobody wants to talk about honestly, I'm not theorizing. I watched it happen in real time.&lt;/p&gt;

&lt;p&gt;Here's the situation: 64% of companies now prohibit AI tools in technical interviews. Meanwhile, 35% of candidates are using LLMs anyway, up from 15% just six months prior. In purely technical roles, that number climbs to 48%. And 61% of those candidates pass the approval threshold and advance without anyone noticing. The ban exists on paper. In practice, it's a suggestion that penalizes the people who follow it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Candidate Tax
&lt;/h2&gt;

&lt;p&gt;This is the part that actually pisses me off. If you're a data engineering candidate who follows the rules, who sits down with your take-home and writes your own SQL, builds your own pipeline, tests your own edge cases, you are now competing against people whose submissions were polished by an LLM in a fraction of the time. And the hiring team cannot tell the difference.&lt;/p&gt;

&lt;p&gt;Cheaters have a roughly 3:1 pass rate advantage. That's not a guess; that's from Fabric's analysis of 19,368 interviews between July 2025 and January 2026. Candidates using AI tools scored above the 7.0 approval threshold 61% of the time. The honest candidates? They're producing slower, rougher, less polished work. Because that's what real human output looks like when you're solving an unfamiliar problem under time pressure.&lt;/p&gt;

&lt;p&gt;It gets worse. Take-home assignments have ballooned. What used to be a 2 to 3 hour exercise is now routinely 10 to 20 hours of unpaid work. Full pipeline implementations, data modeling, documentation, testing, presentations. At that scope, using AI isn't just tempting; it's economically rational. You're asking someone to do a part-time job for free and then punishing them for using the most efficient tool available.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The 20-hour take-home created the cheating incentive. Companies shifted from live coding to extended take-homes to "reduce bias" and inadvertently built the perfect environment for undetectable AI assistance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;83% of candidates say they would use AI if they could get away with it. I'm honestly surprised the number is that low. The game theory here is a textbook prisoner's dilemma: if you assume your competition is cheating (and statistically, they are), following the rules is the losing move. Genuine candidates report feeling forced to cheat because they assume everyone else already is.&lt;/p&gt;

&lt;p&gt;And the detection? It's theater. Some platforms claim 93% accuracy analyzing keystroke patterns and tab-switching behavior. But invisible overlay tools like Cluely and Interview Coder now render answers using DirectX and Metal at the OS level, completely invisible to screen sharing. A second device listening to interview audio works just as well. The detection arms race is over before it started.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ban That Nobody Can Enforce
&lt;/h2&gt;

&lt;p&gt;Here's the double standard that makes this whole thing absurd: 64% of organizations using AI in HR apply it to recruiting and interviewing on their end. They're screening your resume with AI, generating interview questions with AI, scoring your responses with AI. But you, the candidate? You're banned from using AI. Because integrity.&lt;/p&gt;

&lt;p&gt;Amazon explicitly disqualifies candidates caught using AI. Goldman Sachs told campus recruits they "must not use ChatGPT, Google, or any external AI assistance." Noble policies. Zero enforcement mechanism. Neither company has a reliable way to detect it. Enforcement depends on candidates self-reporting or failing live follow-ups.&lt;/p&gt;

&lt;p&gt;71% of engineering leaders admit AI makes technical skills harder to assess. And yet 62% still prohibit it despite acknowledging they cannot detect violations. This isn't a policy; it's a prayer.&lt;/p&gt;

&lt;p&gt;The detection tools themselves are worse than useless. AI detectors bundled into platforms like Turnitin and GPTZero are, by multiple 2026 analyses, "increasingly wrong" because candidates can prompt an LLM to generate novel solutions that plagiarism software flags as original work (because they are). False positive rates range from 1% to 30% depending on the tool. So you've got honest candidates getting flagged for coincidental code similarity while actual cheaters using invisible overlays sail through. The system protects liars better than truth-tellers.&lt;/p&gt;

&lt;p&gt;The core problem isn't that AI is too good. It's that the problem is unsolvable at scale. A candidate can prompt GPT-4 to generate novel, non-plagiarized code for any assignment, and no static analysis can distinguish it from original work without access to the candidate's reasoning process. The only scalable detection is process visibility: pair programming, timestamped drafts, in-person walkthroughs. And companies resist all of those because they don't scale cheaply.&lt;/p&gt;

&lt;p&gt;One company's response, when shown data that 80% of their take-home submissions used LLMs? They decided to ignore the cheating and just move top performers to the next round. That's not a hiring process. That's capitulation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Companies, Three Opposite Bets on the Future
&lt;/h2&gt;

&lt;p&gt;The industry hasn't converged on a solution. It's fractured into at least three incompatible approaches, and if you're job hunting in data engineering right now, you need to understand all of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The AI-required camp.&lt;/strong&gt; Meta launched AI-enabled interviews in October 2025. Candidates work in CoderPad with access to GPT-4o, Claude, Gemini, or Llama. They're evaluated on AI fluency, prompt engineering, output validation, and debugging. The company plans to expand this to all backend and ops roles in 2026. Canva went further: they replaced their entire "Computer Science Fundamentals" interview with "AI-Assisted Coding" for backend, ML, and frontend roles. Candidates must use Copilot, Cursor, or Claude. The problems are designed so they can't be solved with a single prompt; they require iterative thinking and judgment.&lt;/p&gt;

&lt;p&gt;The signal these companies are hiring for isn't "can you code without help." It's "can you direct AI correctly, catch its mistakes, and defend every architectural decision." Candidates who passed these rounds weren't better prompters. They knew what to build, caught what the AI got wrong, and could explain why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ban-and-hope camp.&lt;/strong&gt; Amazon and Goldman Sachs sit here. Explicit prohibition, no reliable detection, trust-based enforcement. Less than 30% of companies that ban AI have actually retrained their interviewers to spot it. The policy exists to provide legal cover, not to change outcomes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hybrid camp.&lt;/strong&gt; 41% of companies now pair a take-home with a synchronous defense session. You do the work at home (with whatever tools you actually use), then you sit down with an engineer for 30 minutes and explain it. This is where LLM help evaporates. If you can't walk through your own solution, modify it on the fly, and handle edge cases in conversation, the take-home score doesn't matter. It's spreading as the unspoken standard because it's the only format that actually tests what companies care about.&lt;/p&gt;

&lt;p&gt;The red flags interviewers are learning to spot in those defense sessions: explanation-code mismatch (your spoken reasoning contradicts what you wrote), terminology beyond your demonstrated level (a junior suddenly discussing architectural patterns they can't elaborate on), and the tell-tale 3 to 5 second delay before every answer that suggests an overlay is generating responses in real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Career Implications Nobody's Saying Out Loud
&lt;/h2&gt;

&lt;p&gt;Entry-level data engineering roles are getting hammered the hardest. Junior candidate cheating nearly tripled, from 15% to 40% year over year. And junior candidates have the lowest detection risk because interviewers expect less fluency from them. A senior engineer dropping suspiciously polished system design answers raises eyebrows. A junior producing clean code? That just looks like a strong candidate.&lt;/p&gt;

&lt;p&gt;This inverts the hiring funnel in a way that should terrify everyone. The most junior, least skilled cohort has the highest incentive to cheat and the best chance of getting away with it. They get hired. They can't do the job. The team absorbs the cost. And six months later, the same team posts the same req, runs the same broken process, and wonders why their pipeline keeps breaking.&lt;/p&gt;

&lt;p&gt;Here's where it lands for your career. If you're a candidate: the interview is a game. It has always been a game. AI didn't make it arbitrary; it was already arbitrary. DS&amp;amp;A has always been a mechanism to rank candidates, not an indicator of data engineering experience. What changed is the rules of the game, and right now nobody agrees on what the rules are. So you need to prepare for all three formats. Know your fundamentals cold; not because a take-home requires it, but because the 30-minute live defense does. That's where the real hiring decision happens now.&lt;/p&gt;

&lt;p&gt;If you're on a hiring panel: stop pretending your take-home ban is enforceable. It isn't. Either redesign around it (hybrid format, live defense, AI-collaborative sessions) or accept that you're selecting for candidates who are good at hiding AI use. That's a skill, sure. It's just not the one you think you're testing for.&lt;/p&gt;

&lt;p&gt;82% of data professionals now use AI tools daily. We're banning in interviews the exact workflow we expect on the job. At some point, the industry has to reconcile those two facts.&lt;/p&gt;

&lt;p&gt;The companies that figure this out first will hire the best engineers. The ones clinging to unenforceable bans will hire the best cheaters. Same resume, same score, very different outcome six months in.&lt;/p&gt;

&lt;p&gt;What's your read? If you're interviewing right now, are you using AI on take-homes, and do you think the hybrid format (take-home plus live defense) actually solves the problem, or just moves it?&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Your Data Engineering Take-Home Is Free Labor</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 04 Jun 2026 10:07:59 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/your-data-engineering-take-home-is-free-labor-2eip</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/your-data-engineering-take-home-is-free-labor-2eip</guid>
      <description>&lt;p&gt;I got sent a &lt;strong&gt;take-home assignment&lt;/strong&gt; last year that asked me to build an end-to-end pipeline: ingest from three APIs, transform in Python, load to a warehouse, write tests, document my design decisions, and prepare a 15-minute presentation for "the team." The recruiter said it should take "about four hours." I timed myself. It took fourteen. I didn't get the job. I didn't get feedback. I got a form rejection three weeks later.&lt;/p&gt;

&lt;p&gt;That's not an interview. That's a consulting engagement with a 0% billing rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scope Creep Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Take-homes weren't always like this. The original pitch was reasonable: instead of whiteboard hazing where you reverse a linked list while someone watches you sweat, you get to work in your own environment, at your own pace, on something resembling real work. That was the deal. A couple hours, a focused problem, maybe a short discussion afterward.&lt;/p&gt;

&lt;p&gt;Then companies got greedy.&lt;/p&gt;

&lt;p&gt;The recommended best practice is still 2 to 4 hours. Over 80% of survey respondents believe take-homes should cap at four hours. But candidates consistently report spending 5x the stated estimate. Companies will write "don't spend more than 3 hours on this" at the top of a prompt that includes building a working MVP, writing a README with architecture docs, recording a demo, adding unit and integration tests, and documenting your trade-offs. That's not a 3-hour task. That's a small freelance project.&lt;/p&gt;

&lt;p&gt;The scope expectations in 2026 are indistinguishable from paid contract work. One candidate reported being asked to create a 30-minute learning module with video, graphics, voiceover, and interactive elements. The estimated freelance market value? $2,800. For an "assessment."&lt;/p&gt;

&lt;p&gt;Here's what's actually happening: &lt;strong&gt;data engineering&lt;/strong&gt; take-homes have quietly evolved from "show us you can write SQL" to "build us a proof of concept we might actually use." And the line between those two things is the line between an interview and unpaid labor.&lt;/p&gt;

&lt;p&gt;58% of engineers believe they deserve payment for take-homes. Only 4% receive it. Read those numbers again.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If a candidate invests 15 hours with a 10% chance of advancing, the expected return per hour is zero. That's not an interview process; that's a lottery where you pay with your weekend.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Free Consulting in Disguise
&lt;/h2&gt;

&lt;p&gt;Let's talk about the part nobody wants to say out loud: some companies are using candidate submissions.&lt;/p&gt;

&lt;p&gt;Indeed's own &lt;strong&gt;hiring&lt;/strong&gt; research flags the concern directly: "Companies may even steal the ideas of candidates, use them, and not give credit or compensate the candidate." That's not a fringe take from a disgruntled Reddit poster. That's on a major job platform's hiring guide.&lt;/p&gt;

&lt;p&gt;The structural problem is simple. When you ask a data engineer to build a pipeline that ingests your actual data format, transforms it according to your actual business logic, and loads it into your actual warehouse schema, you've crossed the line from evaluation to extraction. The candidate doesn't know if their code will ship or be discarded. The company doesn't disclose what happens to submissions. The information asymmetry is total.&lt;/p&gt;

&lt;p&gt;And it gets worse. About 50% of job seekers strongly dislike take-homes and drop out entirely. But here's the paradox: Dropbox found that 20% of candidates abandoned their process before completing assignments, and the ones who dropped out were often the strongest candidates. They had competing offers. They had leverage. They didn't need to grind 15 hours for a maybe.&lt;/p&gt;

&lt;p&gt;So who finishes? Candidates without alternatives. The desperate. The junior engineers without competing offers. The people who can't afford to say no. The take-home isn't filtering for talent; it's filtering for availability.&lt;/p&gt;

&lt;p&gt;This is especially brutal for marginalized candidates. If you're working a second job, handling caregiving, or don't have reliable internet access, a 15-hour unpaid assignment isn't a minor inconvenience. It's a gate that has nothing to do with whether you can do the work. The equity fracture is real: unpaid take-homes create a class-based filter where financial stability determines who gets to compete.&lt;/p&gt;

&lt;p&gt;76% of recruiters say take-homes improve &lt;strong&gt;hiring&lt;/strong&gt; quality. Nobody surveyed the candidates to ask if they felt fairly assessed. Funny how that works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Red Flags Before You Even Open the Repo
&lt;/h2&gt;

&lt;p&gt;After doing somewhere around 20 interview loops in a single job search (some went well, some went laughably poorly), I've developed a pretty reliable radar for take-homes that are going to waste my time. Here's what I look for now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No time estimate at all.&lt;/strong&gt; If the prompt doesn't tell you how long they expect it to take, they either don't know or don't care. Both are bad. A company that can't scope a 3-hour exercise is telling you something about how they scope projects internally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deliverables list is longer than the problem statement.&lt;/strong&gt; When the requirements section says "build a pipeline" but the deliverables section says "working code, tests, documentation, architecture diagram, trade-off analysis, recorded demo, and a 15-minute presentation," you're not being evaluated. You're being outsourced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The data looks suspiciously like their actual data.&lt;/strong&gt; Generic datasets (public APIs, sample CSVs) are fine. When the schema matches their product domain a little too closely, when the transformations feel like real business logic, that's not a coincidence. That's a proof of concept with plausible deniability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No compensation and no timeline.&lt;/strong&gt; The gold standard for a fair take-home is 90 minutes of focused work plus a 30-minute walkthrough. If they're asking for more than 4 hours and offering nothing in return, the math doesn't work in your favor. Labor experts agree: unpaid assignments exceeding 2 to 3 hours cross into territory where compensation is ethically and legally justified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;They ghost after submission.&lt;/strong&gt; 8 in 10 hiring managers admit to ghosting candidates. If you're investing a full weekend into something, you deserve feedback. A company that can't write two paragraphs about why they passed on you after you wrote two thousand lines for them is telling you exactly how they'll treat you as an employee.&lt;/p&gt;

&lt;p&gt;The AI angle makes all of this worse, not better. Cheating rates on take-homes jumped from 15% to 35% in six months. Companies are responding not by making the process fairer, but by making it harder and longer. Classic arms race. The format's legitimacy is collapsing in real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Push Back Without Burning the Bridge
&lt;/h2&gt;

&lt;p&gt;Here's what I've learned the hard way about &lt;strong&gt;career&lt;/strong&gt; management in the interview game: asking for scope clarity isn't weakness. It's the move that separates you from candidates who'll silently overcommit and resent it.&lt;/p&gt;

&lt;p&gt;Before you open your IDE, send this: "I want to make sure I'm aligned with your expectations. I estimate this will take X hours based on the requirements. Is that consistent with what you've seen from other candidates?" That's it. Professional, direct, and it forces them to put a number on the record.&lt;/p&gt;

&lt;p&gt;If the number they come back with is wildly different from your estimate, you have information. Either the scope is genuinely smaller than it looks (great, clarify what's optional) or they're lowballing the time estimate to avoid scaring you off (red flag).&lt;/p&gt;

&lt;p&gt;If the assignment exceeds 4 hours, it's completely reasonable to ask about compensation. Buffer runs 45-day paid trial projects. Webflow uses 3 to 5 day paid contract follow-ups. These aren't charity; a paid engagement costs the company less than a bad hire. If a company balks at paying for 8 hours of your time, they're telling you the ROI math doesn't work, which means they're sending this to dozens of candidates and hoping one sticks.&lt;/p&gt;

&lt;p&gt;And here's the contrarian take that took me years to internalize: if a company demands 20 hours for an initial screen and you walk away, you didn't lose an offer. You dodged a scope-creep culture. The real risk isn't declining; it's spending 20 hours and still losing to someone who had better chemistry in the 30-minute walkthrough. That's not career management. That's gambling with your time as the chips.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Fair Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;Fair take-homes exist. They're just rare.&lt;/p&gt;

&lt;p&gt;The format works when companies respect it: 90 minutes of focused work, a clear rubric, a 30-minute walkthrough to verify you wrote it and can explain it. Well-designed take-homes show 35% higher correlation with job performance compared to whiteboard interviews. The signal is real. The abuse is the problem, not the concept.&lt;/p&gt;

&lt;p&gt;Companies that compensate candidates and set clear requirements see 85%+ completion rates. The ones that don't? 60 to 70%. The data is screaming the answer. Pay people for their time, scope the work honestly, and the format is better than every alternative.&lt;/p&gt;

&lt;p&gt;900+ companies are listed on the Hiring Without Whiteboards repository, explicitly committing to fair evaluation. Some use pair programming on real codebases. Some use portfolio reviews. Some use short, focused take-homes with hard time caps. The alternatives exist. The industry just hasn't decided to care yet.&lt;/p&gt;

&lt;p&gt;The uncomfortable truth is that when hiring fundamentals are correct (clear rubrics, structured evaluation, diverse interviewers), the format almost doesn't matter. Take-home, live coding, pair programming; they all work when the process respects the candidate. They all fail when it doesn't.&lt;/p&gt;

&lt;p&gt;I've been on both sides of the &lt;strong&gt;interview&lt;/strong&gt; table. I've been the candidate grinding through a 14-hour take-home for a form rejection. I've been on hiring panels where we evaluated submissions in under 10 minutes that took candidates an entire weekend. Both of those experiences made me angry for the same reason: the asymmetry is the point, not the bug.&lt;/p&gt;

&lt;p&gt;So here's my question for anyone who's been through this recently: what's the most egregious take-home you've been asked to do, and did you finish it or walk away?&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Tech Lost 150K Jobs in 2026. Data Engineering Gained 414%.</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Tue, 19 May 2026 10:05:18 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/tech-lost-150k-jobs-in-2026-data-engineering-gained-414-4p5n</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/tech-lost-150k-jobs-in-2026-data-engineering-gained-414-4p5n</guid>
      <description>&lt;p&gt;I got laid off once. Not from a data engineering role; from an analytics-adjacent contracting gig that evaporated when budgets got cut. I spent exactly one week feeling sorry for myself, then I started grinding. That was years ago. Since then I've watched three separate waves of "tech is over" panic, sat through two recessions worth of hiring freezes, and somehow ended up at staff level building pipelines at companies you've definitely used. The pattern is always the same: broad panic, selective survival, and a very small number of people who read the room correctly and came out the other side making more money than before.&lt;/p&gt;

&lt;p&gt;2026 is that pattern again, except the signal is louder than it's ever been.&lt;/p&gt;

&lt;h2&gt;
  
  
  150,000 Jobs Gone. Data Engineering Didn't Flinch.
&lt;/h2&gt;

&lt;p&gt;The numbers are ugly. Over 150,000 tech jobs cut across 500+ companies in 2026. Q1 alone saw 52,050 &lt;strong&gt;layoffs&lt;/strong&gt;, a 40% jump over Q1 2025 and the worst first quarter since 2023. That's roughly 973 people per day losing their jobs. If you're in tech and you don't know someone who got hit, you're not paying attention.&lt;/p&gt;

&lt;p&gt;But here's the part nobody's talking about at happy hour: &lt;strong&gt;data engineering&lt;/strong&gt; is projected to grow 414% through 2030. Over 150,000 data engineers are currently employed, with 20,000+ new jobs created in the past year alone. The global data engineering services market hit $105 billion in 2026 and is growing at 15% annually.&lt;/p&gt;

&lt;p&gt;These two facts exist simultaneously. Massive contraction and massive expansion, in the same industry, at the same time.&lt;/p&gt;

&lt;p&gt;The displacement isn't random. Data analytics postings dropped 15.2% year over year. Broader tech postings fell 36%. But data engineering grew. Not "held steady." Grew. This isn't the whole boat rising; it's one lifeboat pulling away while the ship lists.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;40% of data teams expanded headcount in 2025 (up from 14% the year before), even as 41% reported negative budget impacts from economic pressures. They're not adding headcount for fun. They're replacing other roles with engineers who build infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the substitution nobody wants to name. Companies aren't growing data teams out of optimism. They're swapping analysts and BI developers for engineers who can build the plumbing that AI systems need to function. It's not growth; it's triage.&lt;/p&gt;

&lt;h2&gt;
  
  
  The GM Playbook: Fire IT, Hire Data Engineers
&lt;/h2&gt;

&lt;p&gt;If you want to see the pattern in action, look at GM. In May 2026, they laid off 600 salaried IT workers, roughly 10% of their IT department. Identity access management, platform security, software engineering teams. Gone.&lt;/p&gt;

&lt;p&gt;Then they immediately opened positions for data engineering, analytics, AI-native development, and cloud-based engineering.&lt;/p&gt;

&lt;p&gt;This isn't a contradiction. It's a skills swap. GM didn't cut costs and call it a day. They cut roles they decided AI could handle or that weren't generating direct value, then reinvested in the roles they believe are load-bearing for the next five years. Data engineers made that list. Traditional IT didn't.&lt;/p&gt;

&lt;p&gt;And GM isn't unique. The same pattern is playing out across the industry. Companies are discovering that 88% of their agentic AI pilots fail to reach production, not because the models are bad, but because the data infrastructure underneath them is a mess. Disconnected metadata catalogs, fragmented pipelines, schemas that nobody documented, cost optimization that nobody owns. Every failed AI pilot is a job posting for a data engineer.&lt;/p&gt;

&lt;p&gt;The quote I keep seeing in industry reports: "Most teams are &lt;strong&gt;hiring&lt;/strong&gt; data engineers to rebuild the plumbing: cleaner pipelines, faster ingestion, better monitoring, and datasets that can be trusted in production." That's the job. It's always been the job. Now there's a $105 billion market saying it out loud.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI Actually Automates (and What It Can't Touch)
&lt;/h2&gt;

&lt;p&gt;Here's where most people get the &lt;strong&gt;career&lt;/strong&gt; math wrong. They hear "AI is automating data engineering" and assume it's a uniform threat. It's not. The automation is extremely specific, and knowing which side of the line your skills sit on is the difference between a 414% growth curve and a pink slip.&lt;/p&gt;

&lt;p&gt;The numbers on automation rates tell the story clearly. Data quality checks: 70% automatable. ETL pipeline generation: 65%. Database optimization: 58%. Data warehouse and lake architecture: 38%.&lt;/p&gt;

&lt;p&gt;See the pattern? The further you move from "write this query" toward "design this system," the less AI can do. Boilerplate SQL generation? Gone. Figuring out why your pipeline silently dropped 2 million rows last Tuesday because an upstream team changed a schema without telling anyone? That's a human problem. It requires business context, institutional knowledge, and the ability to yell at the right Slack channel at 2am.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt; appears in 70% of 2026 data engineer postings. SQL dropped to 69%, down from 79% in 2025. That's not a typo. SQL, the language that defined data work for decades, is now less common in job postings than Python. The shift tells you exactly what companies are buying: less query-writing, more infrastructure architecture, more orchestration, more systems thinking.&lt;/p&gt;

&lt;p&gt;Gartner projects AI will reduce manual intervention in data engineering by 60%. Which sounds terrifying until you realize the 40% that's left is all the hard stuff. Capacity planning across regions. Schema migrations that touch compliance rules. Cost optimization decisions where the CFO doesn't accept "AI said so" as justification. The comfortable middle of data engineering is getting automated. What's left is the stuff that actually requires judgment.&lt;/p&gt;

&lt;p&gt;I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. The tools change every 18 months. Schema drift, late-arriving data, upstream teams breaking contracts without telling you; these are eternal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Two-Tier Market Is Already Here
&lt;/h2&gt;

&lt;p&gt;The split isn't coming. It's here. And it's creating two very different career trajectories.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier one:&lt;/strong&gt; Entry-level ETL work, boilerplate transformations, basic pipeline assembly. This is automating at 65-70%. If your daily work is writing the same dbt models and Airflow DAGs without understanding why the pipeline exists or what business decision it feeds, you're on the wrong side of this line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier two:&lt;/strong&gt; Architecture, cost optimization, governance, production debugging, ML infrastructure. This is growing. Fast. Data engineers now spend 37% of their work hours on AI-related projects, up from 19% in 2023, projected to hit 61% by 2027. The role isn't shrinking; it's shifting upward.&lt;/p&gt;

&lt;p&gt;And the hiring market reflects this perfectly. 45% of data engineering postings now contain AI-related terms. CI/CD and DevOps appear in one out of every six postings. 26% of postings skip education requirements entirely; they don't care about your degree, they care about your production code samples.&lt;/p&gt;

&lt;p&gt;Here's what that means for your &lt;strong&gt;hiring&lt;/strong&gt; prospects. The companies with unfilled data engineering reqs sitting for 12-18 months aren't struggling because there aren't enough data engineers. They're struggling because there aren't enough data engineers with the right skills. It's a mismatch, not a shortage.&lt;/p&gt;

&lt;p&gt;The most underrated part of this: analytics engineers earn a median of $189,000 versus data engineers at $131,000, yet analytics engineering isn't projected for the same growth. Companies are overpaying for the title they think they need while undertesting for the skills they actually need. I've been on hiring panels where we tested pipeline architecture for an analytics engineer role and business context for a data engineering role. Backwards. Every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Survivors Are Doing Differently
&lt;/h2&gt;

&lt;p&gt;The people who come out of this cycle making $148,000 to $186,000 (the San Francisco range for data engineers right now) aren't the ones who learned one more tool. They're the ones who understood which problems compound.&lt;/p&gt;

&lt;p&gt;Concepts transfer across tools; tool knowledge doesn't transfer across concepts. I've been saying this for years and it's never been more true. The engineer who understands data modeling, query optimization, and why things break will learn whatever orchestrator the company uses in a week. The engineer who memorized Airflow's API but can't explain why a star schema might not be the right choice anymore (hint: the economics killed it) is going to have a harder time.&lt;/p&gt;

&lt;p&gt;The skill stack that's actually getting people hired: Python and SQL as baseline (still non-negotiable, even as SQL's dominance fades). Spark at 38.7% of postings. Cloud fluency, with AWS at 32% market share. And increasingly, AI literacy; not "build a transformer from scratch" but "understand how your pipelines feed ML systems and how to optimize that relationship."&lt;/p&gt;

&lt;p&gt;The real career insight hiding in all of this data: production infrastructure beats research. Every time. Data engineers earning $130K-$180K while data scientists struggle for roles reflects a truth the industry doesn't like admitting. The CFO cares about the pipeline that feeds the board deck, not the model that got 2% better accuracy on a benchmark nobody uses.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Junior engineers worry about which tool to learn. Senior engineers worry about which problems to solve. Staff engineers worry about which problems to prevent.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That hierarchy maps directly onto the automation curve. Tools get automated. Problems don't. Prevention definitely doesn't.&lt;/p&gt;

&lt;p&gt;I've watched people with 10 years of experience get downleveled because they couldn't articulate system design decisions under pressure. I've also watched people with non-traditional backgrounds land staff roles because they could explain exactly why their pipeline was designed the way it was and what it would cost to change it. The interview is a different skill than the job, but both skills reward the same thing: understanding the "why" behind the architecture, not just the "how" of the implementation.&lt;/p&gt;

&lt;p&gt;The 150,000 jobs that vanished in 2026 aren't coming back. The 414% growth curve in data engineering isn't slowing down. The gap between those two numbers is the entire story of tech employment right now. The question is just which side of that gap you're standing on.&lt;/p&gt;

&lt;p&gt;So: what's the one skill in your current stack that you're most worried AI is about to eat? And what are you replacing it with?&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>career</category>
      <category>python</category>
      <category>beginners</category>
    </item>
    <item>
      <title>DSA Is Dying in DE Interviews. Nobody Agrees on What's Next.</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Thu, 14 May 2026 10:05:33 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/dsa-is-dying-in-de-interviews-nobody-agrees-on-whats-next-56kh</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/dsa-is-dying-in-de-interviews-nobody-agrees-on-whats-next-56kh</guid>
      <description>&lt;p&gt;I did somewhere around 20 &lt;strong&gt;interview&lt;/strong&gt; loops in a single job search. Some went well. Some went so poorly I still think about them in the shower. But here's the thing: at least I knew what I was prepping for. LeetCode mediums, maybe a SQL round, maybe a system design conversation. The format was predictable, even if it was stupid. That era is over, and what replaced it is somehow worse.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;data engineering&lt;/strong&gt; community has been screaming for years that &lt;strong&gt;DSA&lt;/strong&gt; doesn't belong in DE interviews. Binary tree traversals, dynamic programming, graph algorithms; none of this maps to the actual job. The actual job is debugging why a pipeline silently dropped 2M rows last Tuesday, not implementing Dijkstra's algorithm on a whiteboard. Reddit finally agreed. r/dataengineering blew up over it. The "NoMoreBigONotations" thread went viral. Companies listened. They dropped the algorithmic rounds.&lt;/p&gt;

&lt;p&gt;And then they replaced them with absolute chaos.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DSA Never Fit Data Engineering in the First Place
&lt;/h2&gt;

&lt;p&gt;Let's be clear about something: &lt;strong&gt;LeetCode&lt;/strong&gt; was never a valid proxy for data engineering skill. It was a borrowed ritual from software engineering interviews that nobody bothered to adapt. Data engineers are rarely expected to write complex algorithms from scratch. We use pre-built libraries and frameworks. The daily work is SQL, pipeline architecture, data modeling, debugging, cost optimization, and dealing with upstream teams who break contracts without telling you.&lt;/p&gt;

&lt;p&gt;The best data engineers I've worked with would struggle on a LeetCode hard. And the engineers who ace competitive programming challenges? They frequently struggle with data modeling, pipeline design, and the kind of real-world optimization that actually matters. It's an inverse correlation, and it's been staring us in the face for years.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;DSA is a mechanism to rank candidates; not an indicator of data engineering experience. Accept it for the arbitrary IQ measuring stick that it is.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;26% of data engineering job ads in 2026 don't even mention education requirements anymore. The industry is finally pivoting toward practical skill assessment. Hiring timelines now exceed 60 to 90 days for complex enterprise roles. Interview loops run 5 to 7 rounds. And yet, the most important question remains unanswered: what are we actually testing for?&lt;/p&gt;

&lt;p&gt;Most candidates don't fail data engineering interviews because of SQL or Python. They fail because they can't connect everything together under pressure and communicate it clearly. That's a completely different skill than reversing a linked list.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Replacement: Three Interviews, Zero Consensus
&lt;/h2&gt;

&lt;p&gt;Here's where it gets ugly. Companies dropped DSA and replaced it with whatever their hiring manager felt like that quarter. There is no standard. There is no consensus. There is barely even a pattern.&lt;/p&gt;

&lt;p&gt;Company A wants you to do a 60-minute Cursor-based live build where you implement a feature in a real codebase. Company B wants pure system design: vague, open-ended, no single correct answer, and every interviewer weights trade-offs differently. Company C sends you the interview rules 24 hours before the onsite, and those rules contradict what the recruiter told you two weeks ago. Company D gives you an 8-hour take-home that's definitely 15 hours of work and pays you nothing for it.&lt;/p&gt;

&lt;p&gt;If you're running parallel loops (and you should be; it's the only sane strategy), you are now simultaneously prepping for three completely different skill sets with zero overlap. One company allows Cursor, one bans it, one grades on "cleverness," one grades on "correctness." This isn't a hiring process. It's a lottery where you don't know which ticket you bought.&lt;/p&gt;

&lt;p&gt;Startups compress everything into 2 to 3 rounds focused on "can you ship on day one." Big Tech runs 4 to 6 standardized rounds emphasizing system design and scale. Mid-market companies? They interview data engineers like they're software engineers, because nobody told them not to. Candidates get blindsided. You prep like it's a data role and walk into SWE-level production-grade coding requirements with full test suites.&lt;/p&gt;

&lt;p&gt;For the architecture-style rounds, &lt;a href="https://www.datadriven.io" rel="noopener noreferrer"&gt;datadriven.io&lt;/a&gt; lets you work through the pipeline-design and data-modeling drills end-to-end instead of just reading about them. That matters, because system design is actually harder to prepare for than LeetCode. At least with DSA, there's consensus on what a good answer looks like. System design? No rubric. No "correct" answer. And every interviewer has a different opinion on whether you should optimize for cost, latency, or data freshness. You're training for a ghost target.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Made It Worse, Not Better
&lt;/h2&gt;

&lt;p&gt;Here's the part nobody wants to say out loud: AI didn't lower the interview bar. It raised it invisibly.&lt;/p&gt;

&lt;p&gt;Canva replaced its "Computer Science Fundamentals" round with "AI-Assisted Coding" in mid-2025. Candidates now face vague, open-ended challenges like "design an aircraft takeoff and landing control system." 64% of companies still ban AI in interviews, but 80% of candidates use LLMs anyway on take-homes. Meanwhile, 67% of startups explicitly allow AI. Meta, Rippling, Google, Canva, and Shopify all permit AI use in live technical sessions. The policy landscape is a mess.&lt;/p&gt;

&lt;p&gt;One CTO told a candidate mid-interview to leave Cursor on. "We want to see how you solve this with AI." The problems got harder. When AI handles the boilerplate, the interviewer's expectations shift from "can you code?" to "can you architect while AI codes for you?" That's a completely different evaluation, and most candidates aren't ready for it.&lt;/p&gt;

&lt;p&gt;The goal has evolved: interviewers want to understand how you evaluate, modify, and trust AI-generated answers. Seniors use AI to compress tedious work while maintaining design control. Staff engineers direct AI through complex tasks while monitoring quality. But here's the problem; nobody tells you which version of this test you're walking into. One company wants to see you pair-program with Cursor like it's a junior engineer on your team. The next company will disqualify you for opening ChatGPT.&lt;/p&gt;

&lt;p&gt;Companies publicly mandate AI usage daily in production, then secretly ban it in interviews. That's not a hiring process. That's a credibility gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Hiring Managers Say They Want (When They Bother to Say Anything)
&lt;/h2&gt;

&lt;p&gt;I've been on &lt;strong&gt;hiring&lt;/strong&gt; panels where we passed on strong candidates for the dumbest reasons. So let me tell you what actually separates the hires from the passes, at least at companies that have thought about it for more than five minutes.&lt;/p&gt;

&lt;p&gt;They want problem-solving mindset over tool knowledge. If you walk into an architecture round and start listing tools instead of describing the problem you're solving, that's a concern. Concepts transfer across tools; tool knowledge doesn't transfer across concepts. This has always been true, and it's finally becoming the interview thesis at companies that are paying attention.&lt;/p&gt;

&lt;p&gt;They want business literacy. A query that runs in 3 seconds instead of 30 might save a downstream BI team hours of waiting. Does the candidate connect technical decisions to business outcomes? If your pipeline is technically perfect but ignores downstream consumers or compliance, you're not a hire. You're a liability.&lt;/p&gt;

&lt;p&gt;They want you to reason about boundaries. Don't propose a single-pattern solution. Describe the boundary between patterns and the contracts that flow across it. That's the senior signal. At staff level, they want to see you prevent problems, not just solve them.&lt;/p&gt;

&lt;p&gt;The irony is thick: these are all reasonable things to test for. But about a third of interview loops include a dedicated data modeling round. A third. The single most important skill in data engineering, and two-thirds of companies don't even have a round for it. They'll spend 45 minutes on a LeetCode medium (or its chaotic replacement) and zero minutes on whether you understand grain, slowly normalized schemas, or why wide denormalized tables with complex types are eating star schema alive.&lt;/p&gt;

&lt;p&gt;Cloud cost efficiency is now one of the highest-scored interview categories. Companies are tying bonus incentives to cloud cost optimizations. This makes sense. Storage is 2 cents per GB per month. Engineer time is $100 an hour. The economics killed star schema, and now they're killing the interview formats that don't test for economic reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Problem Is Nobody Wants to Admit
&lt;/h2&gt;

&lt;p&gt;The inconsistency isn't accidental. It's evidence that the role itself transformed faster than hiring practices could keep up.&lt;/p&gt;

&lt;p&gt;Between 2023 and 2026, data engineering moved from "batch ETL plumber" to a role that combines real-time architecture, cloud cost optimization, metadata governance, platform engineering, and AI integration. Companies testing SQL plus system design plus Cursor builds aren't being random. They're testing for three different versions of the job simultaneously because they don't yet know which version matters most.&lt;/p&gt;

&lt;p&gt;That's not an excuse. It's a diagnosis.&lt;/p&gt;

&lt;p&gt;The community is furious not because DSA is gone, but because at least DSA was consistent. You could grind 50 mediums and be solid. Now? 97% of data engineers report burnout. 70% are likely to leave their jobs within 12 months. Hiring timelines stretch past 90 days. And at the end of that timeline, you might get an offer, be told it was sent, never receive it, do four more rounds, pass again, and have the headcount closed. I'm not making that up. That happened to me.&lt;/p&gt;

&lt;p&gt;The interview process isn't designed for candidates. It's designed for companies to feel thorough. The data engineering community won the argument against DSA, and the prize was chaos.&lt;/p&gt;

&lt;p&gt;I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. The tools change every 18 months. The problems don't change. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal. The interview formats will eventually stabilize around testing for these eternal problems.&lt;/p&gt;

&lt;p&gt;Until then? Treat prep like a job. Accept that every loop will be different. Ask recruiters what types of questions to expect; and if you don't get good answers, look online and at the job description. Prep for system design, SQL fluency, data modeling, and yes, basic Python. Cover the surface area because nobody else is going to narrow it down for you.&lt;/p&gt;

&lt;p&gt;What's the worst interview format you've encountered since companies started dropping DSA rounds? I genuinely want to know, because I thought my eight-round saga was bad, and I keep hearing stories that make it look quaint.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>career</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Junior Data Engineers Are Getting Wiped Out. Seniors Are Thriving.</title>
      <dc:creator>DataDriven</dc:creator>
      <pubDate>Tue, 12 May 2026 10:05:09 +0000</pubDate>
      <link>https://dev.clauneck.workers.dev/datadriven/junior-data-engineers-are-getting-wiped-out-seniors-are-thriving-4j7d</link>
      <guid>https://dev.clauneck.workers.dev/datadriven/junior-data-engineers-are-getting-wiped-out-seniors-are-thriving-4j7d</guid>
      <description>&lt;p&gt;Three years ago, a company I was at hired eight junior data engineers in a single quarter. Boilerplate ETL, basic SQL transforms, test scaffolding, docs. The standard apprenticeship pipeline. Last month, that same company posted two senior DE roles and zero junior ones. The eight seats are gone. Not frozen; gone. The work those engineers did still gets done. An LLM and two staff engineers handle it now.&lt;/p&gt;

&lt;p&gt;This isn't a hot take. It's Q1 2026 by the numbers: 52,050 tech &lt;strong&gt;layoffs&lt;/strong&gt; announced in the first three months of the year, a 40% jump over Q1 2025. Nearly half of those cuts were attributed to AI-driven automation. And the people getting cut aren't the ones designing pipeline architectures or negotiating data contracts with upstream teams. They're the ones writing the boilerplate that AI now generates on demand.&lt;/p&gt;

&lt;p&gt;The seniority bifurcation in &lt;strong&gt;data engineering&lt;/strong&gt; is real, it's accelerating, and if you're early in your &lt;strong&gt;career&lt;/strong&gt;, you need to understand the mechanics of it before you can do anything about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Junior Toolkit Got Automated First
&lt;/h2&gt;

&lt;p&gt;Here's what a typical &lt;strong&gt;junior&lt;/strong&gt; data engineer did two years ago: wrote basic ETL scripts, generated dbt models from specs, built simple Airflow DAGs, ran data quality checks, documented schemas. Useful work. Necessary work. Also, as it turns out, exactly the kind of work that LLMs are terrifyingly good at.&lt;/p&gt;

&lt;p&gt;The numbers are brutal. 70% of data quality checks are now automated. 65% of ETL/ELT pipeline design can be generated by AI code assistants. SQL generation tools hit 90% accuracy on first pass. Developers report 88% productivity increases with AI, spending 60% less time on boilerplate code, database schemas, and API creation.&lt;/p&gt;

&lt;p&gt;That's not "AI is coming for your job" fear-mongering. That's the specific, measurable erosion of the tasks that justified hiring someone at $72K to sit in a seat and learn.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The work isn't gone. The justification for hiring someone cheap to do it is.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Companies that used to bring on cohorts of 5 to 10 junior engineers now handle the same workload with 2 to 3 seniors plus AI tooling. Entry-level data engineer positions dropped 20 to 35% globally over the past 12 months. Recently hired workers (42%) and entry-level employees (41%) face disproportionate layoff risk compared to senior cohorts. The apprenticeship ladder that built every &lt;strong&gt;senior&lt;/strong&gt; engineer reading this article is being pulled up behind us.&lt;/p&gt;

&lt;p&gt;And here's the part that should make you uncomfortable if you're a senior who benefited from that ladder: this isn't a technology readiness problem. There's a fascinating gap in the data. Data engineers show 75% theoretical AI exposure but only 37% observed exposure. Companies &lt;em&gt;know&lt;/em&gt; AI can automate junior work. Many just haven't pulled the trigger yet because complex data systems break in unexpected ways and they'd rather keep a human in the loop than risk a silent pipeline failure from auto-generated code.&lt;/p&gt;

&lt;p&gt;That gap is closing. Fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Seniors Aren't Just Surviving; They're Getting Promoted
&lt;/h2&gt;

&lt;p&gt;While junior roles contract, the senior market is doing something counterintuitive: growing. &lt;strong&gt;Senior&lt;/strong&gt; data engineer compensation is up 12 to 18% year over year. Base salaries hold at $147K to $179K nationally, with top talent in SF commanding $233K. Engineers with Databricks or Snowflake certifications see a 10 to 15% premium on top of that. Roles with demonstrated AI skills command another 15 to 30% salary premium.&lt;/p&gt;

&lt;p&gt;40% of data teams actually grew in 2025, up from 14% the year before, and budgets increased 30%. Read that again. Layoffs and growth are happening simultaneously. That's not contradictory; it's compositional. Companies are cutting junior headcount and reinvesting in senior hires who can own broader scope with AI leverage.&lt;/p&gt;

&lt;p&gt;The global data engineering market hit $105 billion in 2026 and is projected to reach $213 billion by 2031. The Bureau of Labor Statistics projects 36% job growth through 2034. Data engineering is not dying. It's not shrinking. It's getting more expensive and more senior.&lt;/p&gt;

&lt;p&gt;I've been through three waves of "data engineering is getting automated away." Still here. Still employed. Still debugging the same categories of problems. Schema drift, late-arriving data, upstream teams breaking contracts without telling you. These are eternal. AI doesn't fix them because they're not code problems; they're judgment problems, communication problems, business context problems. The kind of problems you can only solve after years of getting burned by them.&lt;/p&gt;

&lt;p&gt;The role is shifting from pipeline plumber to system architect. Senior DEs are moving up the stack while entry-level boilerplate gets consumed by tools. The engineers who thrive won't write the most SQL; they'll design the frameworks that let AI write SQL safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Skills That Actually Matter Now
&lt;/h2&gt;

&lt;p&gt;The bar for what counts as "data engineering skills" moved. A few years ago, you could be a strong DE focused mainly on batch ETL and warehousing. Now teams expect you to support ML workflows, real-time data needs, governance, and cost optimization, all under the same job title.&lt;/p&gt;

&lt;p&gt;Streaming infrastructure went from "nice to have" to competitive moat. Uber launched IngestionNext in March 2026, cutting data latency from hours to minutes and reducing compute costs 25% with Kafka, Flink, and Hudi. I still maintain that most companies don't need streaming (most of y'all don't), but the companies that &lt;em&gt;do&lt;/em&gt; need it are the ones paying $250K+ for the engineers who can build it.&lt;/p&gt;

&lt;p&gt;Cloud proficiency is non-negotiable; over 94% of enterprises have adopted cloud. AI skill requirements appear in 71% of U.S. tech job postings, up 181% year over year. And the real shortage isn't data engineers; it's governance experts wearing data engineer hats. Companies that used to treat governance as a separate function now embed it in every DE hire. If you can articulate data lineage, PII handling, and audit trails, you command a premium. If you can only write Spark jobs, you're becoming a commodity.&lt;/p&gt;

&lt;p&gt;The concept still holds: learn data modeling, query optimization, understanding why things break. Those transfer across every tool. But the floor has risen. The minimum viable senior DE in 2026 needs architecture thinking, AI fluency, governance awareness, and cloud-native platform skills. For the architecture and data modeling side of interview prep, &lt;a href="https://www.datadriven.io" rel="noopener noreferrer"&gt;datadriven.io&lt;/a&gt; lets you work through pipeline-design and modeling drills end-to-end instead of just reading about them; that kind of hands-on practice is what actually builds the muscle.&lt;/p&gt;

&lt;p&gt;Hiring timelines for senior roles have stretched to 60 to 90 days in enterprise settings. That's not bureaucracy; that's scarcity. Companies can't find enough people who combine architecture, AI integration, governance, and platform engineering in a single candidate. The 250,000-person shortage in AI/ML skillsets compounds everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Can Juniors Still Break In?
&lt;/h2&gt;

&lt;p&gt;Yes. But not the way it used to work.&lt;/p&gt;

&lt;p&gt;The direct path into data engineering is mostly gone. "Data engineer" is not an entry-level position. It combines business context, analytics insight, infrastructure, software engineering, and SRE. The industry consensus now expects 2 to 6 years of prior experience, not a first career jump.&lt;/p&gt;

&lt;p&gt;The realistic path looks like this: start as a SQL-heavy data analyst, analytics engineer, DBA, or backend engineer. Spend 18 to 24 months building production experience and domain knowledge. Then transition to DE internally or through a targeted job search. This detour is becoming standard, not exceptional.&lt;/p&gt;

&lt;p&gt;If you're 3 years into an adjacent role running pipelines in production, that's not "close to being ready." You're doing the job. Stop discounting what you've already built.&lt;/p&gt;

&lt;p&gt;Portfolio projects help demonstrate skills but rarely replace production experience. That's the catch-22. You can't get production experience without the role, and you can't get the role without production experience. The way through is the adjacent role. Analyst to analytics engineer to data engineer. It's longer. It works.&lt;/p&gt;

&lt;p&gt;IBM tripled entry-level hiring in 2026, explicitly stating that AI still needs a human touch. That's an outlier, but it proves the path isn't completely closed. Some enterprises still see juniors as necessary friction-catchers. The BLS projects data engineering as one of the fastest-growing roles through 2030. The demand is there; it's just shifted upward in seniority.&lt;/p&gt;

&lt;p&gt;Here's what I'd tell anyone trying to break in right now: stop learning tools. Learn concepts. Data modeling is the core skill. Getting the model wrong upstream means everything downstream is pain. Pick one orchestration tool, build something small that forces you to deal with failures, retries, and alerting. Then pick the next one. Treat the job search like a job. I did somewhere around 20 interview loops in a single search. Some went well. Some went laughably poorly. The grind is the strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ladder Problem
&lt;/h2&gt;

&lt;p&gt;The uncomfortable truth behind all of this is structural. AI creates more high-leverage work for seniors while erasing the stepping stones juniors traditionally used to become seniors. The boilerplate ETL, the basic SQL, the test generation; that was the apprenticeship. That was how you learned why pipelines break, how schemas drift, what happens when upstream teams push breaking changes at 2am. If AI handles all of that, where do future senior engineers come from?&lt;/p&gt;

&lt;p&gt;Nobody's talking about this enough. The industry is celebrating productivity gains without asking what the pipeline (the human one) looks like in five years. Junior engineers who never debug a failed DAG because AI handles it won't develop the foundational understanding necessary to debug complex systems when the AI fails. And AI will fail. It always does, usually at 2am, usually on the pipeline that finance depends on for board decks.&lt;/p&gt;

&lt;p&gt;The data engineering career isn't dying. It's bifurcating. Senior roles are growing, compensation is climbing, and the problems are getting harder and more strategic. Junior roles are contracting, the bar for entry is rising, and the old apprenticeship model is breaking down. Both of these things are true simultaneously.&lt;/p&gt;

&lt;p&gt;I'm not a doomer about this. The field is healthy, expanding, and full of hard problems worth solving. But the path in looks nothing like it did three years ago, and pretending otherwise is a disservice to every bootcamp grad refreshing LinkedIn right now.&lt;/p&gt;

&lt;p&gt;If you're senior: you're in a strong position. Use the leverage. Learn the AI tooling. Move up the stack.&lt;/p&gt;

&lt;p&gt;If you're junior: the path is longer and harder than it was. That's not your fault. It's the industry being the industry. Start adjacent, build real production experience, focus on concepts over tools, and grind.&lt;/p&gt;

&lt;p&gt;What's your read on the junior pipeline problem? Are we building a generation of seniors who never went through the apprenticeship, or will the path just look different? Genuinely curious what people on both sides are seeing.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>career</category>
      <category>beginners</category>
      <category>python</category>
    </item>
  </channel>
</rss>
