Home

GPUParquetWriter

GPUParquetWriter is a high-performance Rust tool designed to rewrite and optimize Parquet files specifically for GPU-accelerated analytics. By transforming "CPU-optimized" layouts (default from DuckDB) into "GPU-optimized" formats, it unlocks the full potential of GPUDirect Storage and massive parallelism.

Optimization Logic

More Pages $\rightarrow$ High SM Utilization.
Larger Row Groups $\rightarrow$ I/O Bandwidth Saturated.
Better Encoding $\rightarrow$ Free Bandwidth Boost.
Save Unnecessary Compression $\rightarrow$ GPU Available for Work.

GPU Parquet reader

To achieve the 129 GB/s throughput shown below, these files must be consumed by a parallelized, GPU-native reader. For example:

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

This tool generated all optimized datasets for the paper.

Benchmarks

Tested on NVIDIA A100 (40GB) with 4x NVMe SSDs (26 GB/s aggregate) using GPUDirect Storage.

The following plot demonstrates the cumulative performance gains of each optimization step, achieving a 129x speedup over the baseline.

Optimization Stage	Throughput	Command Flag	Strategy/Value	Mechanism of Gain
CPU-Optimized Parquet	~1.0 GB/s	generaged from DuckDB	None	Limited by small I/O & idle SMs.
+ Rewrite # Pages	~50 GB/s	`--page-count`	`200`	SM Saturation (More active threads).
+ Rewrite RG Size	~70 GB/s	`--row-group-precise-row-count`	`10M`	I/O Saturation (Chunks >8MB).
+ Better Encoding	~115 GB/s	`--encodings`	`parquet_v2`	Free Bandwidth (Data density).
+ No Unnecessary Comp.	~129 GB/s	`--compression`	`uncompressed`	Compute Freedom (Zero Overhead).

Detailed Configuration Strategies

How to optimize Row Groups?

Larger chunks reduce metadata overhead. Options are mutually exclusive:

--row-group-precise-row-count <N> (Recommended): Forces exactly N rows per group.
--row-group-size <SIZE>: Sets approximate target size (e.g., 1G, 256M).
--row-group-count <N>: Splits file into N equal parts.

How to select Encodings?

--encodings parquet_v2: (Default) Uses modern, efficient encodings (DELTA_BINARY_PACKED, BYTE_STREAM_SPLIT).
--encodings parquet_v1: Restricts to legacy encodings (PLAIN, RLE_DICTIONARY).
--encodings plain: Forces raw PLAIN encoding.

How to choose the best Compression?

Benchmarks multiple codecs per column to find the local optimum.

--compression lightweight: (Default) Tests uncompressed and snappy.
--compression optimal: Brute-force tests all supported algorithms (zstd, lz4, gzip, brotli).
Custom List: e.g., --compression "uncompressed,zstd(3)" allows you to target specific algorithms.

CLI Reference

# Basic Example
cargo run --release -- --input <FILE> --output <FILE> [OPTIONS]

Essential Options:

--input, --output: Paths to source and destination files.
--page-count <N>: Target pages per row group (Default: 0 [auto]).
--row-group-precise-row-count <N>: Target exact rows per group.
--compression <STRATEGY>: lightweight (default), optimal, or specific codecs.
--encodings <STRATEGY>: parquet_v2 (default), parquet_v1, or plain.
--optimizer <STRATEGY>: brute-force (default) checks every combination; threshold uses heuristics for speed.
--full-report-path <PATH>: Exports a CSV detailing the decision process for every column.

For full options: cargo run --release -- --help

Appendix: Reproduction Commands

# Line 1: CPU-Optimized Baseline from DuckDB
# Default generated by DuckDB TPC-H: https://duckdb.org/docs/stable/core_extensions/tpch
#         then export the TPC-H tables to parquet with default config

# Line 2: Rewrite # Pages (Target 200 pages per RG)
./target/release/parquet-column-grained-rewriter \
  --input raw.parquet --output step2_pages.parquet \
  --page-count 200 \
  --compression snappy --encodings parquet_v1 --optimizer threshold

# Line 3: Rewrite RG Size (Target 10M rows per RG)
./target/release/parquet-column-grained-rewriter \
  --input raw.parquet --output step3_rg_size.parquet \
  --page-count 200 --row-group-precise-row-count 10M \
  --compression snappy --encodings parquet_v1 --optimizer threshold

# Line 4: Better Encoding (Parquet V2 with 0.0 threshold to force efficient encodings)
./target/release/parquet-column-grained-rewriter \
  --input raw.parquet --output step4_encoding.parquet \
  --page-count 200 --row-group-precise-row-count 10M \
  --compression snappy --encodings parquet_v2 \
  --optimizer threshold --encoding-threshold 0.0

# Line 5: No Useless Compression (Uncompressed + Parquet V2)
./target/release/parquet-column-grained-rewriter \
  --input raw.parquet --output step5_optimal.parquet \
  --page-count 200 --row-group-precise-row-count 10M \
  --compression uncompressed --encodings parquet_v2 \
  --optimizer threshold --compression-threshold 16.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly