Mutable atomic deletes with Parquet backed columnar tables on S3

In the previous post, I explored a Parquet on S3 design with tombstones for constant time deletes and a CAS updated manifest for snapshot isolation. This post extends that design. The focus is in file delete operations where we replace a Parquet row group and publish a new footer using S3 Multipart Upload (MPU) and UploadPartCopy without having to download and rebuild unchanged bytes. We preserve the same system model with immutable data files and a manifest pointer that serializes visibility.

We’re not replacing tombstones—we’re defining when a targeted row‑group rewrite is warranted, under what constraints, and how it fits the tiny‑catalog (manifest) model.

Goals and constraints

What I am optimizing for

Minimize data movement (no client egress for unchanged bytes)
Preserve snapshot isolation (readers see a coherent version)
Keep moving parts small (S3 + a tiny manifest/pointer)
Make deletes predictable in latency (bounded by a handful of S3 requests)

What the system gives us and expects from us

S3 Multipart Upload with UploadPartCopy for server-side range copies
Atomic assembly on CompleteMultipartUpload (new object key)
Parquet footer is the source of truth for row groups and offsets
5 MiB minimum size for all non-last parts in MPU
With versioning ON, older versions persist until explicitly deleted
Readers doing range requests should send If-Match: <etag> to avoid mixing versions

Recap: Parquet file + table setup

A Parquet file has a data section split into row groups and a footer that records schema, per‑group offsets/sizes/stats, optional indexes, then the 4‑byte footer length and the PAR1 tail. Note: We use those absolute offsets later (rg_start, rg_end, footer_start).

Parquet file (compressed ~128–256 MiB)

┌──────────────────────────────────────────────────────────────────────┐
│ DATA SECTION                                                         │
│  ┌─────────────┬─────────────┬─────────────┬───────────────┐         │
│  │ Row Group 0 │ Row Group 1 │ Row Group 2 │      ...      │         │
│  │ (columns → pages → stats per group)                     │         │
│  └─────────────┴─────────────┴─────────────┴───────────────┘         │
│                                                                      │
│ FOOTER                                                               │
│  - schema                                                            │
│  - row‑group directory (offsets, sizes, stats)                       │
│  - optional column/offset indexes                                    │
│  - 4‑byte footer length + PAR1 tail                                  │
└──────────────────────────────────────────────────────────────────────┘

Table shape and read pattern at a glance

A tiny mutable head _latest_manifest (updated with CAS - compare and swap) points to an immutable manifest/vN.json. That manifest lists the current set of write‑once Parquet data files. Readers first fetch the head to learn the version, then fetch that manifest and range‑read only the files it lists. This post (unlike the previous post) does not use tombstones; the manifest alone defines visibility.

Head (CAS) → Snapshot → Data files

┌ _latest_manifest ┐
│ {"version": N}   │
└──────┬───────────┘
       ▼
manifest/vN.json
  data_files:
    - data/YYYY/MM/DD/HH/<uuid>.parquet
    - ...
  tombstones: []   # not used here

mytable/
├── _latest_manifest
├── manifest/v00000124.json
└── data/YYYY/MM/DD/HH/<uuid>.parquet

Lets begin: What we are building

Say we need to delete only a subset of data in a Parquet file inside a row group for a customer request. The objective is to make that change visible atomically without rebuilding the entire object and without downloading unchanged bytes. We keep the system model from the previous post with immutable data files and a CAS updated manifest that serializes visibility. We assemble a new object with MPU (multi part upload) and UploadPartCopy. We download only the target row group and the footer then remove the rows and re encode that group and upload the edited bytes + Parquet footer. Unchanged prefix and suffix bytes are copied entirely within S3.

Layout and sizes

We focus on 128–256 MiB Parquet files with many row groups of say about 8-16 MiB each. This size keeps scan efficiency high and still lets a single edit localize to one row group.

mytable/
  _latest_manifest                     # small, mutable CAS pointer (version + ETag)
  data/
    2025/10/12/09/f81d4fae.parquet
    2025/10/12/10/a1b2c3d4.parquet
    ...

Head points to the current list of data files. Mutations rewrite only the file that contains affected rows and then update _latest_manifest last.

Mutable head + data files

┌ _latest_manifest (CAS) ┐
│ {"version": 123}  │
└──────┬──────────┘
       │ points to
       ▼
manifest/v00000123.json → {
  data_files: [
    "data/2025/10/12/09/f81d4fae.parquet",
    "data/2025/10/12/10/a1b2c3d4.parquet",
    ...
  ],
  tombstones: [...]
}

Rewrite only the object that contains the row group, then CAS head to a list with the new object key.

We pack data into 128–256 MiB Parquet objects with many row groups (≈8–16 MiB each). On deletion, we splice a new object via MPU by copying the prefix/suffix and uploading the edited row group and a new footer.

Implementation: Step-by-step replacement of row group

Locate row → (object key, row_group_idx, row_idx)
Range GET the footer and only the target row group
Decode the downloaded row group then re encode it without the row then compute new stats
Start MPU for destination object
- UploadPartCopy(prefix up to row group)
- UploadPart(modified row group) # ≥ 5 MiB if not last part
- UploadPartCopy(suffix after row group)
- UploadPart(new footer) # last part can be small
- Complete MPU (new object ready)
Atomically swap _latest_manifest (compare-and-swap on ETag)
Optionally delete old object/version

Readers fetch _latest_manifest once, then read only what it references. They never combine old and new.

Byte range view

bytes: 0                                    rg_start          rg_end            footer_start           end
       │──────────────────────────────────────│█████ edited ███│──────────────────│████ footer (new) ███│
       │               PREFIX (copy)         │     RG (upload) │      SUFFIX (copy)      │  len + PAR1    │

How the object is assembled

We use offsets discovered from the Parquet footer:

rg_start: absolute byte offset of the first byte of the target row group (spans all columns for that group)
rg_end: absolute byte offset immediately after the target row group
footer_start: absolute byte offset of the footer region

The prefix (bytes=0…rg_start-1) and suffix (bytes=rg_end…footer_start-1) are copied with UploadPartCopy via CopySourceRange. The final part contains the regenerated Parquet footer bytes followed by the 4‑byte footer length and the trailing PAR1, and is uploaded last.

S3 MPU Assembly (new key)

┌───────────────────────────────────────────────────────────────────────┐
│ Existing Object                                                       │
│ [           PREFIX            ][   ROW GROUP   ][      SUFFIX       ] │
│  bytes 0 … rg_start-1           rg_start…rg_end   rg_end…footer_start │
└───────────────────────────────────────────────────────────────────────┘

Parts:
  A: UploadPartCopy(PREFIX)       ← server-side, no egress
  B: UploadPart(edited RG)        ← ≥ 5 MiB (non-last part rule)
  C: UploadPartCopy(SUFFIX)       ← server-side, no egress
  D: UploadPart(new FOOTER+len+PAR1)  ← last part can be tiny

CompleteMultipartUpload → new object assembled; visibility flips on manifest CAS.

Client and S3 sequence

Client                    S3
  │   GET footer            │
  │────────────────────────▶│ discover ranges & stats
  │                         │
  │   Create MPU            │
  │────────────────────────▶│
  │   Part A: Copy PREFIX   │
  │────────────────────────▶│ (UploadPartCopy)
  │   Part B: Upload RG     │
  │────────────────────────▶│ (UploadPart ≥ 5 MiB)
  │   Part C: Copy SUFFIX   │
  │────────────────────────▶│ (UploadPartCopy)
  │   Part D: Upload FOOTER │
  │────────────────────────▶│ (UploadPart last)
  │   Complete MPU          │
  │────────────────────────▶│ assemble new object (not visible yet)
  │                         │
  │   Reader uses If-Match  │
  │◀────────────────────────│ range `GET`s pinned to `ETag`

This sequence captures the exact order of operations. You discover ranges from the footer, start a multipart upload (MPU), copy the prefix, upload the edited row group, copy the suffix, upload the footer as the last part, and then complete the upload. Readers pin to a single ETag so range requests stay consistent during a scan.

Re‑encoding a row group means decoding the pages for that group, removing the targeted rows, and writing new pages for each column using the same Parquet encodings. Page boundaries can change when rows are removed. That is expected and supported by the format. A new footer is then written that references the new absolute offsets, the new sizes, the updated statistics, and optional index sections for the edited group. Building a footer is not an unconventional trick. Parquet libraries in Java, C++, and Rust expose programmatic writers that let you construct a file from buffers and metadata. The careful part is computing the correct absolute offsets and keeping index offsets in sync if you use column and offset indexes.

Conceptual implementation

The conceptual implementation would then look something like this:

Read Parquet footer and locate the target row_group_idx entry.
Range GET the row group byte window [rg_start, rg_end).
Decode pages for each column in the group; remove targeted rows; re‑encode pages using the same encodings/compression.
Compute new_rg_size for the edited group; delta = new_rg_size - (rg_end - rg_start).
Rebuild footer:
- Update total row count.
- Update per‑column stats for the edited group.
- Keep prior groups unchanged.
- Shift absolute offsets for all following groups by delta.
- Optionally rebuild column/offset indexes for the edited group; others are unchanged.
Assemble the object with MPU: copy prefix [0, rg_start), upload edited RG, copy suffix [rg_end, footer_start), upload new footer+len+PAR1 last.

Offset math

Let RGi.offset be the absolute byte offset of row group i in the file. Let k be the edited group.

old_rg_size = rg_end - rg_start
delta = new_rg_size - old_rg_size
For i < k: RGi.offset' = RGi.offset (unchanged)
For i = k: RGk.offset' = rg_start (size changes internally)
For i > k: RGi.offset' = RGi.offset + delta
footer_start' = footer_start + delta

Stats and optional index offsets for i > k shift by delta if they are stored as absolute offsets; if stored relative to group start they remain the same.

API specifics

Use CopySourceRange with the form bytes=first-last for prefix and suffix copies. For example bytes=0-1048575 copies the first 1 MiB. The source object must be larger than 5 MiB for a range copy and every non last part including copied parts must meet the 5 MiB minimum. Here MiB means mebibyte which is 1,048,576 bytes.

Why Parquet still works here

Parquet encodes columns inside row groups and stores a footer with offsets and statistics. When you rebuild one row group you change only the bytes for that group across all columns, you shift the absolute offsets for the groups that follow by the size delta, and you update the total row count and the per column statistics for the edited group. Everything else is copied as is with UploadPartCopy and range reads stay large and coalesced.

Caveats and gotchas

S3 and MPU

Every non‑last part (including copied parts) must be ≥ 5 MiB (S3 requirement); widen the edited window or write a full new object otherwise.
Copy ranges are inclusive (bytes=first-last); an off‑by‑one will corrupt layout.
Assemble to a new key; do not mutate in place. Abort failed MPUs to avoid billed leftovers.
Treat MPU ETags as version tokens; they are not MD5 hashes.

Parquet specifics

You must rebuild the footer: update edited‑group stats, total rows, and shift absolute offsets after the edit.
If the file uses column/offset indexes or bloom filters, regenerate for the edited group or omit them (still valid).
Be careful with stats semantics (logical types, NaN ordering) to avoid wrong pruning.

Operational

Use CAS on the manifest to flip visibility; readers never guess filenames.
Garbage‑collect orphaned objects created by lost CAS races.
Batch edits by row group and perform one rewrite per group to spread the cost.

Atomicity

The mutable _latest_manifest gives snapshot isolation. Readers see a coherent table version as of the head they fetched; new writes are invisible mid-scan. For immediate erasure in versioned buckets, delete prior versions or use per‑object encryption and drop old keys.

A mid‑file row‑group edit behaves the same as any rewrite: the new object is assembled atomically; readers pin scans with If-Match on the observed ETag.

View of the head swap

Before commit
_latest_manifest → v123
parquet object @ ETag A

After commit
_latest_manifest → v124
parquet object @ ETag B

Readers pin to a single ETag while scanning

When to choose what

Use tombstones with a manifest when deletes are frequent and small and you can compact later. This gives very fast commits and minimal writes.
Use MPU row group rewrite when you need immediate byte removal and the edited group is ≥ 5 MiB. This avoids rewriting the whole file.
Use whole file rewrite when the file is small or widening the middle part is awkward. A single PUT is straightforward.

When this fits and when it doesn’t

This pattern fits append‑heavy analytics on S3 where scans project a few columns, and you sometimes need immediate erasure of specific rows. It works well if you’re comfortable owning a small manifest and a simple GC job. It’s not a great fit for OLTP‑style point lookups without a PK locator, high‑churn per‑row updates, or heavy schema evolution and catalog features—in those cases, a full metadata layer like Iceberg or Delta is usually the better choice perhaps?

Closing

This keeps the parts small and sharp: immutable data objects, a tiny CAS head, and localized rewrites using MPU + UploadPartCopy. You get Parquet’s scan speed, S3’s range‑read economics, and immediate hard deletes—without a coordinating service. Taken together with the previous post, I’d say these concepts get you close to a database built on modern cloud‑storage APIs for the right patterns while keeping operations simple (mostly) :D.