Fast Parquet reading: From Java to Rust…

Feb 4

Imagine you have a Parquet file with well-defined schema and you want to have the fastest Parquet reader for files with that schema.

Read →

4 Comments

Vadim Mi

Feb 4

Thanks for sharing, Art!

What's the total file size? 2.85 ms is to read the whole parquet file?

Expand full comment

Reply (1)

Art

Feb 4

The total file size is 470.3 Kbytes. That's correct, 2.85 ms is the average runtime across 100 runs to read parquet file and use the read data to accumulate the counter.

Expand full comment

Reply (1)

Vadim Mi

Feb 4

Correct me, if I'm wrong. The performance is approximately 166 MB/s. If so, it seems there are bottlenecks somewhere. I expect that reading binary files will be faster.

For example, I measured nlohmann's JSON parsing benchmark(this lib is not meant to be performant). The parsing performance is around 104-241MB/s depending on the file.

Expand full comment

Reply (1)

Art

Feb 4

@vadimmi your calculcations are correct, however, Parquet is quite sophisticated format that has dictionary and RLE encodings, compression.

For example, if I disable dictionary encoding, compression and use PLAIN encoding

```diff

Index: src/bin/js2pq/main.rs

IDEA additional info:

Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP

<+>UTF-8

===================================================================

diff --git a/src/bin/js2pq/main.rs b/src/bin/js2pq/main.rs

--- a/src/bin/js2pq/main.rs (revision 67dba32ff2b84851e975ea3d1a277d5957928e83)

+++ b/src/bin/js2pq/main.rs (date 1738675499916)

@@ -88,7 +88,6 @@

StatisticsMode::Page => EnabledStatistics::Page,

};

let file = std::fs::File::create(args.output_parquet_file_path)?;

- let compression = Compression::ZSTD(ZstdLevel::try_new(3)?);

let sums_double_col = get_list_column_path("sums_double");

let sums_long_col = get_list_column_path("sums_long");

let count_col = get_list_column_path("count");

@@ -96,8 +95,8 @@

let builder = WriterProperties::builder()

.set_statistics_enabled(enabled_statistics)

.set_writer_version(WriterVersion::PARQUET_2_0)

- .set_dictionary_enabled(true)

- .set_compression(compression);

+ .set_dictionary_enabled(false)

+ .set_encoding(Encoding::PLAIN);

let builder = if !args.use_flatbuffers {

// Not much benefit on having status on sums and count, disable it

builder

```

it produces Parquet for the same input with the size 1990.04 Kbytes which makes it around ~700 Mbytes/s. Uncompressed source JSON file has a size of 3163.67 Kytes.

Expand full comment

Art’s Substack

Fast Parquet reading: From Java to Rust…