How to write DRY code in Rust
Often times we find ourself writing quite repetitive code in Rust with slight difference. This is against Don't repeat yourself principle. In this article I'm going to show you a specific example of such repetition when reading Parquet file in Rust and how to generalize it using traits, generics and closures. For simplicity we will be reading non-nested and non-struct columns from Parquet file using columnar reader (it is also known as Vectorized Columnar Reader).
To read a string column here is method read_string_column
To read a column with type Int64 method read_i64_column
The source code of these two methods. Clearly two methods are almost identical, the diff already hints us what we can do:
Even though StringBuilder and PrimitiveBuilder implement ArrayBuilder, the trait ArrayBuilder does not have methods append_null
and append_value
. How can we overcome this?
Using closure
The idea is to give a caller control over:
Appending non NULL
Caller can do extra transformation (for example, convert ByteArray to str)
Appending NULL
Creating builder
On top of that we need make typed_rdr
argument generic. Below is the signature of the method
Isn't there too many generic parameters you can ask? Unfortunately, with closure approach it is inevitable (or if there is a way, please, let me know). Full method
And refactored methods to read string and i64
There is still some duplication happening, however, maintaining single read_column_v1 method is easier than two almost identical that we had before. Source code read_column_v1.rs
Sealed trait approach
The traits can be used to extend the functionality of types, even if types are defined in other libraries. Intuitively we want to have our custom trait parameterized with generic T that can be either ArrowPrimitiveType
or ByteArrayType
(to support strings). As code it would look like
You might noticed this isn't valid Rust code, there is no such OR
operator when defining trait bounds. There is +
operator, but it is for the cases when T must implement both traits, more Specifying Multiple Trait Bounds with the + Syntax. However, there is a workaround for that. The sealed trait approach in Rust solves the problem of “I only want certain types to implement my trait” by hiding it behind a private module, so that only your own crate can implement that seal. This has two main effects:
Prevents unauthorized implementations - Since the sealing trait is private (not visible outside your crate), no external code can implement it
Enumerates exactly which types are allowed - You implement the sealed trait only for the specific types you want to allow
Once a type T must implement the sealed trait (via T: sealed::Sealed), only those types for which you explicitly wrote impl Sealed for X {} can be used. That’s how we achieve the restriction “T can be one of exactly these N types”.
What is left now is to adjust ArrayBuilderEx<T>
to use Supported
and implement that trait for all interested builders
Full version of read_column_v2
This one looks tidier to me, however, it required some level of Rust type-system acrobatics to convince compiler that the values read from the Parquet column <T as DataType>::T
are the same as the type the Arrow builder expects <T as Supported>::Native
. Source code read_column_v2.rs.