How columnar storage works

Columnar storage writes each column of a table to disk as its own contiguous run of values, split into compressed blocks tagged with min/max metadata. A query reads only the column blocks it needs, decompresses them, and operates on them. That cuts I/O to the columns a query touches and lets same-type values compress 5–10× tighter than the mixed types in a row.

TL;DR #

Columnar storage keeps each column contiguous on disk in compressed blocks; a query reads only the columns, and only the blocks, it needs.
Three mechanisms do the work: per-block compression, per-block min/max "zone maps", and predicate pushdown that skips blocks before decompressing them.
Two physical layouts are in production use: pure columnar (one file per column) and PAX/hybrid (Parquet, ORC).
ClickHouse MergeTree is a storage engine: immutable parts, one .bin file per column in Wide format, a sparse primary index over 8192-row granules, and background merges.
Apache Parquet is a self-describing file format: file → row groups → column chunks → pages, with statistics in a footer. ClickHouse reads and writes it directly, so the two coexist.

What is columnar storage? #

Columnar storage is a physical layout that keeps the values of each column together on disk, rather than keeping each row's fields together. It suits analytical queries that scan a few columns across many rows, and it trades cheap full-row reads for cheap column scans and high compression.

See what a columnar database is for the concept and its trade-offs, and the head-to-head comparison of row- and column-oriented databases for when to choose each. The rest of this page goes under the hood: how those columns are actually written, indexed, compressed, and read back.

How does columnar storage work? #

Columnar storage works by writing each column as a contiguous sequence of values, chunked into compressed blocks that each carry min/max metadata. To answer a query, the engine reads only the blocks of the columns referenced, uses the min/max values to skip blocks that cannot match the filter, decompresses the survivors, and operates on them directly.

The difference is visible at the byte level. A row store interleaves every field of a row; a column store separates them; a PAX/hybrid format does both at once:

Row store (slotted page): rows kept whole, fields interleaved
┌───────────────────────────────────────────────────────────┐
│ (1,"alice",NYC,42)(2,"bob",LA,37)(3,"cara",SF,29)(4,"dan"…  │
└───────────────────────────────────────────────────────────┘
  a SUM(age) scan still pulls in the name and city bytes it never uses

Pure columnar: one contiguous run per column
  id   │ 1 2 3 4 5 …
  name │ alice bob cara dan …
  city │ NYC LA SF NYC …
  age  │ 42 37 29 51 …        ← SUM(age) reads only this run

PAX / hybrid (Parquet, ORC): a row group is a horizontal slice,
columns stored contiguously *within* each group
┌──── row group 1 ────────┐  ┌──── row group 2 ────────┐
│ id  │ 1 2 3 …           │  │ id  │ 1001 1002 …       │
│ name│ alice bob …       │  │ name│ …                 │
│ city│ NYC LA …          │  │ city│ …                 │
│ age │ 42 37 …           │  │ age │ …                 │
└─────────────────────────┘  └─────────────────────────┘

The read path: skip, decompress, materialise #

The read path is built to touch as little data as possible. The engine first uses each block's min/max metadata (a "zone map") to discard blocks that cannot contain a matching value, before any decompression happens. This is predicate pushdown: a filter like WHERE age > 60 skips every block whose stored maximum is 60 or below. Only the surviving blocks are decompressed. Many engines then apply late materialisation: read the filter columns first, find which rows survive, and read the remaining columns only for those rows. The combined effect is that a query reading 2 of 50 columns, with a selective filter, may decode a single-digit percentage of the table's bytes.

Why columns compress: encoding before the codec #

Compression piggybacks on the layout. A run of values from one column shares a type and usually a narrow range, so it compresses far better than a row of mixed types. Columnar engines apply lightweight, type-aware encodings first (dictionary encoding for low-cardinality strings, run-length encoding (RLE) for repeated values, delta encoding for sorted or monotonic numbers, and bit-packing for small integers) and only then run a general-purpose codec such as LZ4 or ZSTD over the result. Encoding before the codec is why columnar compression ratios routinely beat row-store ratios on the same data.

Two physical layouts: pure columnar vs PAX #

Pure columnar stores each column as its own object, decoupled from the others. PAX (Partition Attributes Across), introduced by Ailamaki et al. in their VLDB 2001 paper Weaving Relations for Cache Performance, keeps a horizontal slice of rows together as a unit but stores the columns contiguously within that slice. Apache Parquet and ORC are PAX formats: the row group is the horizontal slice, the column chunk is the contiguous column inside it. Both layouts share the same per-column read and compression mechanics; they differ in how a "unit of data" maps to a file.

How does ClickHouse MergeTree store columns on disk? #

ClickHouse MergeTree stores a table as a set of immutable parts. In Wide-format parts each column is a separate .bin file, with rows sorted by the table's ORDER BY key and grouped into granules of 8192 rows. A sparse primary index holds one mark per granule, so reads jump straight to the granules a query needs instead of scanning row by row.

table
└─ part  (immutable, rows sorted by ORDER BY)        many parts per table
   ├─ id.bin      ┐
   ├─ name.bin    │  one .bin per column (Wide format)
   ├─ city.bin    │  data split into compressed blocks, up to ~1 MB each
   ├─ age.bin     ┘
   ├─ primary.idx    sparse index: one mark per granule (every 8192 rows)
   └─ *.mrk          marks → byte offsets into each .bin file

Parts, granules, and the sparse index #

A part is the unit MergeTree reads, sorts, and merges. Because rows inside a part are sorted by the ORDER BY key, the primary index can be sparse: it records the key value at the start of each 8192-row granule (the default index_granularity) rather than one entry per row, so the whole index fits in memory even for billion-row tables. To answer a query, ClickHouse uses the sparse primary index plus any data-skipping indexes to select which granules can contain matching rows, then reads only the relevant columns' compressed blocks for those granules. The on-disk details (parts, granules, marks, and the sparse index) are documented in the MergeTree engine reference.

Writes: inserts, merges, and mutations #

Writes never edit data in place. Each INSERT creates a new part; a background process continually combines small parts into larger sorted ones (the merge in MergeTree). Because parts are immutable, changing existing data is handled above this layer: standard SQL UPDATE and DELETE FROM are well suited to individual or scattered rows, while a traditional ALTER ... UPDATE/DELETE mutation rewrites the affected parts for bulk operations such as backfills or schema-level corrections. (For high-volume upserts and CDC, ReplacingMergeTree ingests changes as new rows and reconciles them during merges, with the FINAL keyword giving deduplicated results at query time.) This append-then-merge design is what lets ClickHouse sustain high insert rates while keeping data sorted for fast reads.

Type-aware codecs #

MergeTree column codecs are composable and type-aware. Specialised codecs such as Delta, DoubleDelta, Gorilla, T64, and GCD can be chained ahead of a general-purpose codec, LZ4 by default or ZSTD, as in CODEC(Delta, ZSTD) for a monotonically increasing timestamp column. Matching the codec to the column's data shape, then compressing, is the same encode-before-codec pattern that all columnar engines use, exposed as a per-column setting. For codec selection and the ratios each achieves, see the deep dive on database compression encodings and codecs.

How does Parquet store columns on disk? #

Apache Parquet is a self-describing columnar file format. A file divides into row groups; each row group holds one column chunk per column; each chunk splits into pages. A footer written at the end of the file stores the schema plus min/max statistics, so a reader opens the footer first and skips row groups and pages that cannot match a filter.

file.parquet
├─ row group 1   (recommended 512 MB – 1 GB)
│  ├─ column chunk: id    ┐  one chunk per column per row group
│  │   ├─ page  ┐  default ~1 MB cap (spec recommends 8 KB; often smaller)
│  │   └─ page  ┘  dictionary / RLE / delta encoded
│  ├─ column chunk: name
│  └─ column chunk: age
├─ row group 2 …
└─ footer   schema + per-row-group and per-page min·max statistics

The Parquet file-format spec recommends large row groups of 512 MB to 1 GB (some writers default to 128 MB to align with HDFS block sizes). Within a chunk, pages are encoded with dictionary, RLE, and delta encodings before a general-purpose codec. Predicate pushdown works off the metadata: an engine reads the footer, then uses row-group and page-level min/max statistics, plus dictionaries where present, to skip whole chunks and pages before decoding any values.

Parquet and MergeTree are complementary, not rivals. ClickHouse reads and writes Parquet directly, for example with the s3() table function over object storage. A ClickHouse data lake setup keeps cold data as Parquet in a lake and hot, query-serving data in MergeTree. For a format-by-format comparison of Parquet, ORC, and Arrow, see columnar storage formats explained.

FAQ #

Is MergeTree the same as Parquet? No. MergeTree is a storage engine inside a running ClickHouse server; Parquet is a static file format. Both store data by column, but MergeTree adds a sort order, a sparse primary index, and mutability through merges, while Parquet is immutable and engine-agnostic.

What is a row group? A row group is a horizontal slice of a Parquet file, a batch of rows whose columns are each stored contiguously inside that slice. The spec recommends 512 MB–1 GB per row group. It is the unit an engine reads or skips using footer statistics.

What is a granule in ClickHouse? A granule is a block of rows inside a MergeTree part, 8192 rows by default, set by index_granularity. The sparse primary index stores one mark per granule, so ClickHouse locates and reads whole granules rather than individual rows.

How does columnar storage compress data? Storing same-type values together lets lightweight encodings (dictionary, run-length, delta, bit-packing) run before a general-purpose codec such as LZ4 or ZSTD. Adjacent similar values compress far better than the mixed types in a row, often 5–10× tighter.

Is Parquet a database? No. Parquet is a file format with no query engine, server, or indexes of its own. Engines such as ClickHouse, DuckDB, and Spark read and write Parquet, but the format itself only defines how bytes are laid out on disk.

Frequently asked questions

Is MergeTree the same as Parquet?

No. MergeTree is a storage engine inside a running ClickHouse server; Parquet is a static file format. Both store data by column, but MergeTree adds a sort order, a sparse primary index, and mutability through merges, while Parquet is immutable and engine-agnostic.

What is a row group?

A row group is a horizontal slice of a Parquet file, a batch of rows whose columns are each stored contiguously inside that slice. The spec recommends 512 MB–1 GB per row group. It is the unit an engine reads or skips using footer statistics.

What is a granule in ClickHouse?

A granule is a block of rows inside a MergeTree part, 8192 rows by default, set by index_granularity. The sparse primary index stores one mark per granule, so ClickHouse locates and reads whole granules rather than individual rows.

How does columnar storage compress data?

Storing same-type values together lets lightweight encodings (dictionary, run-length, delta, bit-packing) run before a general-purpose codec such as LZ4 or ZSTD. Adjacent similar values compress far better than the mixed types in a row, often 5–10× tighter.

Is Parquet a database?

No. Parquet is a file format with no query engine, server, or indexes of its own. Engines such as ClickHouse, DuckDB, and Spark read and write Parquet, but the format itself only defines how bytes are laid out on disk.