/ Shayon Mukherjee / blog

Mutable atomic deletes with Parquet backed columnar tables on S3

October 12, 2025
~10 mins

Contents

In the previous post, I explored a Parquet on S3 design with tombstones for constant time deletes and a CAS updated manifest for snapshot isolation. This post extends that design. The focus is in file delete operations where we replace a Parquet row group and publish a new footer using S3 Multipart Upload (MPU) and UploadPartCopy without having to download and rebuild unchanged bytes. We preserve the same system model with immutable data files and a manifest pointer that serializes visibility.

We’re not replacing tombstones—we’re defining when a targeted row‑group rewrite is warranted, under what constraints, and how it fits the tiny‑catalog (manifest) model.

Goals and constraints

What I am optimizing for

What the system gives us and expects from us

Recap: Parquet file + table setup

A Parquet file has a data section split into row groups and a footer that records schema, per‑group offsets/sizes/stats, optional indexes, then the 4‑byte footer length and the PAR1 tail. Note: We use those absolute offsets later (rg_start, rg_end, footer_start).

Parquet file (compressed ~128256 MiB)

┌──────────────────────────────────────────────────────────────────────┐
│ DATA SECTION                                                         │
│  ┌─────────────┬─────────────┬─────────────┬───────────────┐         │
│  │ Row Group 0Row Group 1Row Group 2 │      ...      │         │
│  │ (columns → pages → stats per group)                     │         │
│  └─────────────┴─────────────┴─────────────┴───────────────┘         │
│                                                                      │
│ FOOTER                                                               │
│  - schema                                                            │
│  - rowgroup directory (offsets, sizes, stats)                       │
│  - optional column/offset indexes                                    │
│  - 4‑byte footer length + PAR1 tail                                  │
└──────────────────────────────────────────────────────────────────────┘

Table shape and read pattern at a glance

A tiny mutable head _latest_manifest (updated with CAS - compare and swap) points to an immutable manifest/vN.json. That manifest lists the current set of write‑once Parquet data files. Readers first fetch the head to learn the version, then fetch that manifest and range‑read only the files it lists. This post (unlike the previous post) does not use tombstones; the manifest alone defines visibility.

Head (CAS)  Snapshot  Data files

 _latest_manifest 
 {"version": N}   
└──────┬───────────┘
       
manifest/vN.json
  data_files:
    - data/YYYY/MM/DD/HH/<uuid>.parquet
    - ...
  tombstones: []   # not used here
mytable/
├── _latest_manifest
├── manifest/v00000124.json
└── data/YYYY/MM/DD/HH/<uuid>.parquet

Lets begin: What we are building

Say we need to delete only a subset of data in a Parquet file inside a row group for a customer request. The objective is to make that change visible atomically without rebuilding the entire object and without downloading unchanged bytes. We keep the system model from the previous post with immutable data files and a CAS updated manifest that serializes visibility. We assemble a new object with MPU (multi part upload) and UploadPartCopy. We download only the target row group and the footer then remove the rows and re encode that group and upload the edited bytes + Parquet footer. Unchanged prefix and suffix bytes are copied entirely within S3.

Layout and sizes

We focus on 128–256 MiB Parquet files with many row groups of say about 8-16 MiB each. This size keeps scan efficiency high and still lets a single edit localize to one row group.

mytable/
  _latest_manifest                     # small, mutable CAS pointer (version + ETag)
  data/
    2025/10/12/09/f81d4fae.parquet
    2025/10/12/10/a1b2c3d4.parquet
    ...

Head points to the current list of data files. Mutations rewrite only the file that contains affected rows and then update _latest_manifest last.

Mutable head + data files

┌ _latest_manifest (CAS) ┐
│ {"version": 123}  │
└──────┬──────────┘
       │ points to
       ▼
manifest/v00000123.json → {
  data_files: [
    "data/2025/10/12/09/f81d4fae.parquet",
    "data/2025/10/12/10/a1b2c3d4.parquet",
    ...
  ],
  tombstones: [...]
}

Rewrite only the object that contains the row group, then CAS head to a list with the new object key.

We pack data into 128–256 MiB Parquet objects with many row groups (≈8–16 MiB each). On deletion, we splice a new object via MPU by copying the prefix/suffix and uploading the edited row group and a new footer.

Implementation: Step-by-step replacement of row group

Readers fetch _latest_manifest once, then read only what it references. They never combine old and new.

Byte range view

bytes: 0                                    rg_start          rg_end            footer_start           end
       │──────────────────────────────────────│█████ edited ███│──────────────────│████ footer (new) ███│
       │               PREFIX (copy)         │     RG (upload) │      SUFFIX (copy)      │  len + PAR1    │

How the object is assembled

We use offsets discovered from the Parquet footer:

The prefix (bytes=0…rg_start-1) and suffix (bytes=rg_end…footer_start-1) are copied with UploadPartCopy via CopySourceRange. The final part contains the regenerated Parquet footer bytes followed by the 4‑byte footer length and the trailing PAR1, and is uploaded last.

S3 MPU Assembly (new key)

┌───────────────────────────────────────────────────────────────────────┐
│ Existing Object                                                       │
│ [           PREFIX            ][   ROW GROUP   ][      SUFFIX       ] │
│  bytes 0rg_start-1           rg_startrg_end   rg_endfooter_start │
└───────────────────────────────────────────────────────────────────────┘

Parts:
  A: UploadPartCopy(PREFIX)       ← server-side, no egress
  B: UploadPart(edited RG)        ← ≥ 5 MiB (non-last part rule)
  C: UploadPartCopy(SUFFIX)       ← server-side, no egress
  D: UploadPart(new FOOTER+len+PAR1)  ← last part can be tiny

CompleteMultipartUploadnew object assembled; visibility flips on manifest CAS.

Client and S3 sequence

Client                    S3
  │   GET footer            │
  │────────────────────────▶│ discover ranges & stats
  │                         │
  │   Create MPU            │
  │────────────────────────▶│
  │   Part A: Copy PREFIX   │
  │────────────────────────▶│ (UploadPartCopy)
  │   Part B: Upload RG     │
  │────────────────────────▶│ (UploadPart ≥ 5 MiB)
  │   Part C: Copy SUFFIX   │
  │────────────────────────▶│ (UploadPartCopy)
  │   Part D: Upload FOOTER │
  │────────────────────────▶│ (UploadPart last)
  │   Complete MPU          │
  │────────────────────────▶│ assemble new object (not visible yet)
  │                         │
  │   Reader uses If-Match  │
  │◀────────────────────────│ range `GET`s pinned to `ETag`

This sequence captures the exact order of operations. You discover ranges from the footer, start a multipart upload (MPU), copy the prefix, upload the edited row group, copy the suffix, upload the footer as the last part, and then complete the upload. Readers pin to a single ETag so range requests stay consistent during a scan.

Re‑encoding a row group means decoding the pages for that group, removing the targeted rows, and writing new pages for each column using the same Parquet encodings. Page boundaries can change when rows are removed. That is expected and supported by the format. A new footer is then written that references the new absolute offsets, the new sizes, the updated statistics, and optional index sections for the edited group. Building a footer is not an unconventional trick. Parquet libraries in Java, C++, and Rust expose programmatic writers that let you construct a file from buffers and metadata. The careful part is computing the correct absolute offsets and keeping index offsets in sync if you use column and offset indexes.

Conceptual implementation

The conceptual implementation would then look something like this:

  1. Read Parquet footer and locate the target row_group_idx entry.
  2. Range GET the row group byte window [rg_start, rg_end).
  3. Decode pages for each column in the group; remove targeted rows; re‑encode pages using the same encodings/compression.
  4. Compute new_rg_size for the edited group; delta = new_rg_size - (rg_end - rg_start).
  5. Rebuild footer:
    • Update total row count.
    • Update per‑column stats for the edited group.
    • Keep prior groups unchanged.
    • Shift absolute offsets for all following groups by delta.
    • Optionally rebuild column/offset indexes for the edited group; others are unchanged.
  6. Assemble the object with MPU: copy prefix [0, rg_start), upload edited RG, copy suffix [rg_end, footer_start), upload new footer+len+PAR1 last.

Offset math

Let RGi.offset be the absolute byte offset of row group i in the file. Let k be the edited group.

Stats and optional index offsets for i > k shift by delta if they are stored as absolute offsets; if stored relative to group start they remain the same.

API specifics

Use CopySourceRange with the form bytes=first-last for prefix and suffix copies. For example bytes=0-1048575 copies the first 1 MiB. The source object must be larger than 5 MiB for a range copy and every non last part including copied parts must meet the 5 MiB minimum. Here MiB means mebibyte which is 1,048,576 bytes.

Why Parquet still works here

Parquet encodes columns inside row groups and stores a footer with offsets and statistics. When you rebuild one row group you change only the bytes for that group across all columns, you shift the absolute offsets for the groups that follow by the size delta, and you update the total row count and the per column statistics for the edited group. Everything else is copied as is with UploadPartCopy and range reads stay large and coalesced.

Caveats and gotchas

S3 and MPU

Parquet specifics

Operational

Atomicity

The mutable _latest_manifest gives snapshot isolation. Readers see a coherent table version as of the head they fetched; new writes are invisible mid-scan. For immediate erasure in versioned buckets, delete prior versions or use per‑object encryption and drop old keys.

A mid‑file row‑group edit behaves the same as any rewrite: the new object is assembled atomically; readers pin scans with If-Match on the observed ETag.

View of the head swap

Before commit
_latest_manifestv123
parquet object @ ETag A

After commit
_latest_manifestv124
parquet object @ ETag B

Readers pin to a single ETag while scanning

When to choose what

When this fits and when it doesn’t

This pattern fits append‑heavy analytics on S3 where scans project a few columns, and you sometimes need immediate erasure of specific rows. It works well if you’re comfortable owning a small manifest and a simple GC job. It’s not a great fit for OLTP‑style point lookups without a PK locator, high‑churn per‑row updates, or heavy schema evolution and catalog features—in those cases, a full metadata layer like Iceberg or Delta is usually the better choice perhaps?

Closing

This keeps the parts small and sharp: immutable data objects, a tiny CAS head, and localized rewrites using MPU + UploadPartCopy. You get Parquet’s scan speed, S3’s range‑read economics, and immediate hard deletes—without a coordinating service. Taken together with the previous post, I’d say these concepts get you close to a database built on modern cloud‑storage APIs for the right patterns while keeping operations simple (mostly) :D.

last modified October 12, 2025