The total file size is 470.3 Kbytes. That's correct, 2.85 ms is the average runtime across 100 runs to read parquet file and use the read data to accumulate the counter.
Correct me, if I'm wrong. The performance is approximately 166 MB/s. If so, it seems there are bottlenecks somewhere. I expect that reading binary files will be faster.
For example, I measured nlohmann's JSON parsing benchmark(this lib is not meant to be performant). The parsing performance is around 104-241MB/s depending on the file.
let file = std::fs::File::create(args.output_parquet_file_path)?;
- let compression = Compression::ZSTD(ZstdLevel::try_new(3)?);
let sums_double_col = get_list_column_path("sums_double");
let sums_long_col = get_list_column_path("sums_long");
let count_col = get_list_column_path("count");
@@ -96,8 +95,8 @@
let builder = WriterProperties::builder()
.set_statistics_enabled(enabled_statistics)
.set_writer_version(WriterVersion::PARQUET_2_0)
- .set_dictionary_enabled(true)
- .set_compression(compression);
+ .set_dictionary_enabled(false)
+ .set_encoding(Encoding::PLAIN);
let builder = if !args.use_flatbuffers {
// Not much benefit on having status on sums and count, disable it
builder
```
it produces Parquet for the same input with the size 1990.04 Kbytes which makes it around ~700 Mbytes/s. Uncompressed source JSON file has a size of 3163.67 Kytes.
Thanks for sharing, Art!
What's the total file size? 2.85 ms is to read the whole parquet file?
The total file size is 470.3 Kbytes. That's correct, 2.85 ms is the average runtime across 100 runs to read parquet file and use the read data to accumulate the counter.
Correct me, if I'm wrong. The performance is approximately 166 MB/s. If so, it seems there are bottlenecks somewhere. I expect that reading binary files will be faster.
For example, I measured nlohmann's JSON parsing benchmark(this lib is not meant to be performant). The parsing performance is around 104-241MB/s depending on the file.
@vadimmi your calculcations are correct, however, Parquet is quite sophisticated format that has dictionary and RLE encodings, compression.
For example, if I disable dictionary encoding, compression and use PLAIN encoding
```diff
Index: src/bin/js2pq/main.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/bin/js2pq/main.rs b/src/bin/js2pq/main.rs
--- a/src/bin/js2pq/main.rs (revision 67dba32ff2b84851e975ea3d1a277d5957928e83)
+++ b/src/bin/js2pq/main.rs (date 1738675499916)
@@ -88,7 +88,6 @@
StatisticsMode::Page => EnabledStatistics::Page,
};
let file = std::fs::File::create(args.output_parquet_file_path)?;
- let compression = Compression::ZSTD(ZstdLevel::try_new(3)?);
let sums_double_col = get_list_column_path("sums_double");
let sums_long_col = get_list_column_path("sums_long");
let count_col = get_list_column_path("count");
@@ -96,8 +95,8 @@
let builder = WriterProperties::builder()
.set_statistics_enabled(enabled_statistics)
.set_writer_version(WriterVersion::PARQUET_2_0)
- .set_dictionary_enabled(true)
- .set_compression(compression);
+ .set_dictionary_enabled(false)
+ .set_encoding(Encoding::PLAIN);
let builder = if !args.use_flatbuffers {
// Not much benefit on having status on sums and count, disable it
builder
```
it produces Parquet for the same input with the size 1990.04 Kbytes which makes it around ~700 Mbytes/s. Uncompressed source JSON file has a size of 3163.67 Kytes.