Skip to content
Chen Qi edited this page Apr 24, 2026 · 1 revision

GPUParquetWriter

GPUParquetWriter is a high-performance Rust tool designed to rewrite and optimize Parquet files specifically for GPU-accelerated analytics. By transforming "CPU-optimized" layouts (default from DuckDB) into "GPU-optimized" formats, it unlocks the full potential of GPUDirect Storage and massive parallelism.

Optimization Logic

  • More Pages $\rightarrow$ High SM Utilization.
  • Larger Row Groups $\rightarrow$ I/O Bandwidth Saturated.
  • Better Encoding $\rightarrow$ Free Bandwidth Boost.
  • Save Unnecessary Compression $\rightarrow$ GPU Available for Work.

GPU Parquet reader

To achieve the 129 GB/s throughput shown below, these files must be consumed by a parallelized, GPU-native reader. For example:

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

This tool generated all optimized datasets for the paper.

Benchmarks

Tested on NVIDIA A100 (40GB) with 4x NVMe SSDs (26 GB/s aggregate) using GPUDirect Storage.

The following plot demonstrates the cumulative performance gains of each optimization step, achieving a 129x speedup over the baseline.

image
Optimization Stage Throughput Command Flag Strategy/Value Mechanism of Gain
CPU-Optimized Parquet ~1.0 GB/s generaged from DuckDB None Limited by small I/O & idle SMs.
+ Rewrite # Pages ~50 GB/s --page-count 200 SM Saturation (More active threads).
+ Rewrite RG Size ~70 GB/s --row-group-precise-row-count 10M I/O Saturation (Chunks >8MB).
+ Better Encoding ~115 GB/s --encodings parquet_v2 Free Bandwidth (Data density).
+ No Unnecessary Comp. ~129 GB/s --compression uncompressed Compute Freedom (Zero Overhead).

Detailed Configuration Strategies

How to optimize Row Groups?

Larger chunks reduce metadata overhead. Options are mutually exclusive:

  • --row-group-precise-row-count <N> (Recommended): Forces exactly N rows per group.
  • --row-group-size <SIZE>: Sets approximate target size (e.g., 1G, 256M).
  • --row-group-count <N>: Splits file into N equal parts.

How to select Encodings?

  • --encodings parquet_v2: (Default) Uses modern, efficient encodings (DELTA_BINARY_PACKED, BYTE_STREAM_SPLIT).
  • --encodings parquet_v1: Restricts to legacy encodings (PLAIN, RLE_DICTIONARY).
  • --encodings plain: Forces raw PLAIN encoding.

How to choose the best Compression?

Benchmarks multiple codecs per column to find the local optimum.

  • --compression lightweight: (Default) Tests uncompressed and snappy.
  • --compression optimal: Brute-force tests all supported algorithms (zstd, lz4, gzip, brotli).
  • Custom List: e.g., --compression "uncompressed,zstd(3)" allows you to target specific algorithms.

CLI Reference

# Basic Example
cargo run --release -- --input <FILE> --output <FILE> [OPTIONS]

Essential Options:

  • --input, --output: Paths to source and destination files.
  • --page-count <N>: Target pages per row group (Default: 0 [auto]).
  • --row-group-precise-row-count <N>: Target exact rows per group.
  • --compression <STRATEGY>: lightweight (default), optimal, or specific codecs.
  • --encodings <STRATEGY>: parquet_v2 (default), parquet_v1, or plain.
  • --optimizer <STRATEGY>: brute-force (default) checks every combination; threshold uses heuristics for speed.
  • --full-report-path <PATH>: Exports a CSV detailing the decision process for every column.

For full options: cargo run --release -- --help

Appendix: Reproduction Commands

# Line 1: CPU-Optimized Baseline from DuckDB
# Default generated by DuckDB TPC-H: https://duckdb.org/docs/stable/core_extensions/tpch
#         then export the TPC-H tables to parquet with default config

# Line 2: Rewrite # Pages (Target 200 pages per RG)
./target/release/parquet-column-grained-rewriter \
  --input raw.parquet --output step2_pages.parquet \
  --page-count 200 \
  --compression snappy --encodings parquet_v1 --optimizer threshold

# Line 3: Rewrite RG Size (Target 10M rows per RG)
./target/release/parquet-column-grained-rewriter \
  --input raw.parquet --output step3_rg_size.parquet \
  --page-count 200 --row-group-precise-row-count 10M \
  --compression snappy --encodings parquet_v1 --optimizer threshold

# Line 4: Better Encoding (Parquet V2 with 0.0 threshold to force efficient encodings)
./target/release/parquet-column-grained-rewriter \
  --input raw.parquet --output step4_encoding.parquet \
  --page-count 200 --row-group-precise-row-count 10M \
  --compression snappy --encodings parquet_v2 \
  --optimizer threshold --encoding-threshold 0.0

# Line 5: No Useless Compression (Uncompressed + Parquet V2)
./target/release/parquet-column-grained-rewriter \
  --input raw.parquet --output step5_optimal.parquet \
  --page-count 200 --row-group-precise-row-count 10M \
  --compression uncompressed --encodings parquet_v2 \
  --optimizer threshold --compression-threshold 16.0