/ Shayon Mukherjee / blog

Another look into PostgreSQL CTE materialization and non-idempotent subqueries

May 4, 2025
~7 mins

A few days ago, I wrote about a surprising planner behavior with CTEs, DELETE, and LIMIT in PostgreSQL, a piece I hastily put together on a bus ride. That post clearly only scratched the surface of a deeper issue that I’ve since spent way too many hours exploring. So here are some more formed thoughts and findings.

The core issue revisited

Let’s quickly recap: when using a query like the following, the planner might execute your query in ways you don’t expect. Specifically, a query like:

WITH deleted_tasks AS (
  DELETE FROM task_queue
  WHERE id IN (
    SELECT id FROM task_queue
    WHERE queue_group_id = 15
    LIMIT 1
    FOR UPDATE SKIP LOCKED
  )
  RETURNING item_id
)
SELECT * FROM deleted_tasks;

…can sometimes delete multiple rows despite the LIMIT 1.

This is what took me my surprise at first as the planner might execute the subquery containing LIMIT N multiple times within the execution of a single DELETE command.

Why does this happen?

PostgreSQL’s query planner tries to optimize the overall query execution. In some cases, its cost-based analysis might lead it to choose a plan where the subquery (the part with LIMIT) gets executed multiple times within the context of the single DELETE command. If the subquery doesn’t guarantee returning the same rows each time (e.g., due to LIMIT without ORDER BY, or because of FOR UPDATE SKIP LOCKED), each execution can target different rows, leading to more rows being deleted than the LIMIT intended.

However, there is more nuance to this, so let’s see how different ways of writing the query affect the execution plan and the outcomes.

Problematic base case: IN with LIMIT N

This is the query we started with, demonstrating the issue.

-- Original problematic query
EXPLAIN ANALYZE
WITH deleted_tasks AS (
  DELETE FROM task_queue
  WHERE id IN (
    SELECT id FROM task_queue t2
    WHERE t2.queue_group_id = 5 -- Using group 5 from test data
    LIMIT 10                   -- Limit without ORDER BY
    FOR UPDATE SKIP LOCKED
  )
  RETURNING item_id
)
SELECT * FROM deleted_tasks;

-- Plan
Delete on task_queue ... (loops=1)
  ->  Nested Loop Semi Join ... (loops=1)
        ->  Seq Scan on task_queue ... (loops=1)
        ->  Subquery Scan on "ANY_subquery" ... (loops=14) -- <<< SUBQUERY INVOLVED IN MANY LOOPS
              ->  Limit ... (actual time=... rows=1 loops=14)  -- <<< LIMIT EXECUTED MANY TIMES
                    ->  Index Scan on task_queue t2 ...

Behavior: Incorrect. The non-deterministic LIMIT 10 subquery is executed multiple times (14 loops in the example plan). Each execution can also select different rows, causing the DELETE to potentially affect far more than 10 rows.

Adding determinism with ORDER BY

Let’s make the subquery deterministic regarding row choice by adding ORDER BY and see what happens:

EXPLAIN ANALYZE
WITH deleted_tasks AS (
  DELETE FROM task_queue
  WHERE id IN (
    SELECT id FROM task_queue t2
    WHERE t2.queue_group_id = 5
    ORDER BY id -- <<< Added ORDER BY
    LIMIT 10
    FOR UPDATE SKIP LOCKED
  )
  RETURNING item_id
)
SELECT * FROM deleted_tasks;

-- Plan
Delete on task_queue ...
  ->  Nested Loop Semi Join ... (loops=1)
        ->  Seq Scan on task_queue ... (loops=1)
        ->  Subquery Scan on "ANY_subquery" ... (loops=14) -- <<< STILL MIGHT BE EXECUTED MULTIPLE TIMES
              ->  Limit ... (actual time=... rows=10 loops=14)

Behavior: Still potentially incorrect. While adding ORDER BY id makes the selection of rows deterministic if no rows were skipped, the combination with FOR UPDATE SKIP LOCKED means the subquery result can still change on each execution. If the planner executes the subquery multiple times (as shown by loops=14), it might lock and skip different rows in each pass. This combination can still lead to deleting more rows than specified by LIMIT because the set of ids returned by the subquery isn’t stable across executions due to the locking side-effect. It only guarantees which rows would be chosen first if available, not that only LIMIT rows will be deleted in total across all potential executions.

More determinism with ORDER BY (without locks)

What if we remove FOR UPDATE SKIP LOCKED but keep the ORDER BY? This makes the subquery truly deterministic?

EXPLAIN ANALYZE
WITH deleted_tasks AS (
  DELETE FROM task_queue
  WHERE id IN (
    SELECT id FROM task_queue t2
    WHERE t2.queue_group_id = 5
    ORDER BY id -- <<< Keep ORDER BY
    LIMIT 10
    -- <<< No FOR UPDATE SKIP LOCKED
  )
  RETURNING item_id
)
SELECT * FROM deleted_tasks;

-- Plan
Delete on task_queue ... (loops=1)
  ->  Nested Loop Semi Join ... (loops=1)
        ->  Seq Scan on task_queue ... (loops=1)
        ->  Subquery Scan on "ANY_subquery" ... (loops=12) -- <<< STILL EXECUTED MULTIPLE TIMES
              ->  Limit ... (actual time=... rows=10 loops=12)
                    ->  Sort ... (actual time=... rows=... loops=12)
                          ->  Index Scan on task_queue t2 ...

Behavior: Correct, but potentially inefficient. Removing SKIP LOCKED and adding ORDER BY id makes the subquery fully idempotent – it will always return the exact same set of 10 ids. Because the set of ids is stable, the DELETE will correctly only affect those specific 10 rows, respecting the LIMIT. However, as the example plan shows (loops=12), the planner is still not forced to evaluate the subquery only once. It might choose a plan that executes the idempotent subquery repeatedly. This doesn’t affect correctness but leads to redundant work and potential inefficiency compared to forcing single execution.

Forcing single evaluation: LIMIT 1 with the equals operator

When dealing specifically with LIMIT 1, using = provides a strong hint to the planner.

-- Using = for LIMIT 1
EXPLAIN ANALYZE
DELETE FROM task_queue
WHERE id = ( -- <<< Using =
  SELECT id FROM task_queue
  WHERE queue_group_id = 5
  ORDER BY id -- Keep ORDER BY for determinism
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
RETURNING item_id;

-- Plan
Delete on task_queue (cost=0.57..8.59 ...)
  ->  Nested Loop (cost=0.57..8.59 ...)
        ->  Result (cost=0.00..0.01 ...)
              InitPlan 1 (returns $0) -- <<< SUBQUERY EVALUATED ONCE HERE
                ->  Limit (cost=0.29..0.43 ... loops=1) -- <<< loops=1
                      ->  Index Scan ... (queue_group_id = 5)
        ->  Index Scan using task_queue_pkey on task_queue (cost=0.28..8.15 ...)
              Index Cond: (id = $0)

Behavior: Correct. The = operator expects a single scalar value, forcing the planner to evaluate the subquery once (often via an InitPlan) to get that value before proceeding. loops=1 confirms single execution. This bypasses issues from both non-determinism and SKIP LOCKED side-effects. However, this only works if your LIMIT is only going to be 1 at all times.

Explicit CTE materialization (PostgreSQL 12+)

For cases involving LIMIT (especially LIMIT > 1) or when you need absolute certainty, while using CTE, explicitly materializing the CTE is the most robust approach:

EXPLAIN (ANALYZE, BUFFERS)
WITH rows_to_delete AS MATERIALIZED (
  SELECT id FROM task_queue
  WHERE queue_group_id = 5 -- Using group 5
  LIMIT 10
  FOR UPDATE SKIP LOCKED
)
DELETE FROM task_queue
WHERE id IN (SELECT id FROM rows_to_delete)
RETURNING item_id;

-- Plan
Delete on task_queue (cost=0.43..8.64 rows=10 width=6)
  CTE rows_to_delete
    ->  Limit (cost=0.29..0.59 rows=10 width=4) (actual time=0.025..0.027 rows=10 loops=1) -- <<< loops=1
          ->  Index Scan using idx_task_queue_group_id on task_queue task_queue_1 (cost=0.29..32.54 rows=1000 width=4)
                Index Cond: (queue_group_id = 5)
  ->  Nested Loop (cost=0.29..8.00 rows=10 width=12)
        ->  CTE Scan on rows_to_delete (cost=0.00..0.20 rows=10 width=4)
        ->  Index Scan using task_queue_pkey on task_queue (cost=0.28..0.78 rows=1 width=14)
              Index Cond: (id = rows_to_delete.id)

Behavior: Correct and Guaranteed Single Execution. This is the recommended approach for CTEs involving LIMIT N (especially when N > 1) where correctness and predictable behavior are paramount. The MATERIALIZED keyword explicitly instructs PostgreSQL to execute the CTE query (rows_to_delete) exactly once, storing its results temporarily. The main DELETE query then operates on this fixed set of results (via CTE Scan). This completely prevents the subquery re-evaluation issue seen earlier, ensuring correctness (exactly LIMIT N rows identified, respecting ORDER BY and SKIP LOCKED within that single execution) and predictable performance.

Key takeaways

Understanding CTE materialization and subquery evaluation has been a bit of an eye opening insight into PostgreSQL query planning for me. So, here are some takeaways that come to mind:

Lastly, all of this has been a journey of a lot of trial and error and spelunking into the PostgreSQL code. If you find that I missed or overlooked something, I’d love to learn.

last modified May 4, 2025