Blog / Engineering

A Year of Rust in ClickHouse

Alexey Milovidov
Apr 1, 2025 - 16 minutes read

Rust is a great programming language for terminal applications, fantasy consoles, and web3. If you do an experiment and say "C++" anywhere on the Internet, in a minute someone will chime in and educate you about the existence of Rust. Sadly, ClickHouse is written in C++, not in Rust. If only we had a chance to attract a large army of passionate Rust professionals... So, I decided that we must add Rust to ClickHouse.

The goal was not to rewrite ClickHouse in Rust - this would be a waste of time. We write ClickHouse in C++ and rewrite some parts from C++ to a better C++. Rewriting is never an end goal, although I know examples when engineers rewrite code from Rust in Rust if they like to rewrite everything in Rust. We should allow writing new system components in Rust. It should be integrated into the build system, so the Rust code interoperates with C++, and it is built and tested together with no complications.

First Steps

The first step was to find a small component to kick off the Rust integration. This component should not be in the critical path, it should be something that we can rip off at any time, so if our engineers become too nauseous from Rust poisoning, we just roll it back and forget. At the same time, it should be something large enough to test the integration to the build system. To not risk the mental health of our precious senior C++ developers, I outsourced this task to an undergrad student. To be honest, it's much easier to find Rust experts among undergrad students than inside the company. Rust is prominent in crypto and web3, so the first component to test it was - the integration of the BLAKE3 hash function into ClickHouse. At that time, BLAKE3 was only implemented in Rust, so it was at least somewhat reasonable.

BLAKE3

In a few months, thanks to Denis Bolonin, we had the first code in Rust in ClickHouse. Rust was added to CMake build scripts using the "corrosion" integration; a dedicated directory for Rust code was added to the project tree, and we created examples of adding C++ wrappers for Rust libraries. You can read more about this contribution in the blog post.

Skim

We have made it possible, and we opened floodgates to let the brilliant Rust developers into the ClickHouse codebase! So now they can write something other than new versions of old terminal applications. I was eagerly checking the list of pull requests for the first external contribution by Rust lovers... So they did come! The first external pull request with Rust was to improve our terminal application, clickhouse-client, by adding interactive history navigation. Note: before that, it was already added by calling a subprocess, but thanks to the integration with Rust, now we build and link this code into the ClickHouse binary. However, the Rust library has a bug and crashed.

After adding a second Rust library, skim, we found that the BLAKE3 function is already present in the LLVM's C++ codebase, which we also use. After upgrading LLVM, it was found much simpler to use BLAKE3 from there rather than with a Rust library. And to prove that the Rust integration works and is supported in the CI, it is enough to have only one Rust library. So, we replaced BLAKE3 from Rust to C++, and there was no noticeable difference in performance.

PRQL

The third library in Rust that we added to ClickHouse was PRQL. It is a query language, an alternative to SQL, which allows expressing queries in a pipelined, composable form. As a downside, it is more syntax-heavy than SQL, reminding something like Closure. It is popular on the Internet - someone shows PRQL, everyone else thinks "What a wonderful idea, and, also, Rust" and gives this project a star on GitHub. This is, by the way, how most of Rust projects get their stars on GitHub. It doesn't look like someone wants to use this language, but what we want is to ride the hype. Another student selected the integration of PRQL in ClickHouse as a graduate work.

An example of a SQL query in the ClickHouse dialect:

An example of the same query in PRQL:

After a few months, thanks to Alexander Nam, we had PRQL support merged into the ClickHouse codebase. ClickHouse has support for switchable dialects. Any time, you can write "SET dialect = 'prql'" or 'kusto', or set it back to 'sql' to switch between different languages. You can specify the dialect in the user's profile, or in the HTTP API's parameter, like any other setting, and pass your queries in the selected dialect. However, there are some downsides of alternative dialects, like the lack of interactive syntax highlighting for all dialects except SQL, and poorer integration with built-in functions and extensions.

Regardless of questions on its usefulness, PRQL brought us better confidence in Rust. After having two Rust libraries in our codebase, we started to understand that it would be hard to get rid of Rust, and we could tolerate it.

Delta Kernel

This is the first library we have added not just for the sake of Rust, but for practical needs. Delta Lake is an open data lake format from Databricks. It mimics the operation of MergeTree tables in ClickHouse but is based on Parquet files in object storage. Data lakes and "Lakehouse" (a new term, that sounds surprisingly similar to ClickHouse, but is different) - are increasingly popular ways for managing data in large organizations, as there is a benefit when the data format is independent of the storage engine. So, one engine (such as Spark) can write into a data lake in the Iceberg or Delta Lake format, and another engine (such as ClickHouse) can run queries on top of the data lake, so the whole architecture is named "data lakehouse".

The problem is that there are no good C++ implementations for either Iceberg or Delta Lake. Either you use Java (which would be a shame for us) or write it from scratch in C++. At first glance, it is not too hard to write Iceberg and Delta Lake implementation in C++, because each format is represented by these components: - a bunch of Parquet files on object storage; - a bunch of metadata files, and snapshot descriptions in formats like JSON and Avro; - a catalog, which is an API to find the tables. We already had C++ implementations of each of the data lake formats: Iceberg v1, v2, Delta Lake, and Apache Hudi. However, these implementations were sketchy, with many corner cases unsupported, and they are not particularly pleasant to write (no C++ engineer can be excited by the task of "read this JSON file and follow the spec precisely").

Here is the area where the increasing popularity of Rust showed in the best way. Recently, Databricks created the official library for Delta Lake in Rust, and we used it, replacing our old C++ code. See the presentation from Oussama Saoudi, Software Engineer at Databricks at our release call. We also improved the quality of the library for everyone, as we found and fixed a bug and added a feature. It was the most complex and the most real integration of Rust into ClickHouse. So let's review what problems we had to solve...

Problems

Supply chain

The initial integration of Rust was downloading libraries from the Internet during the build, by running cargo. But our build must be hermetic and reproducible. It must work without internet access, and it should not depend on 3rd-party services, as it could be a supply-chain problem. Also, we should build everything from the source code - no binary blobs in the repository.

This problem was solved initially by suggesting a way to disable Rust, then by stashing dependencies in the docker image, and finally, by vendoring all dependencies, although it was not final.

Complex wrappers

Rust is a memory-safe programming language, but to the surprise of our contributors, the first attempt to add Rust code often leads to a segmentation fault. This is because the interop between Rust and C++ requires writing the wrappers, and it has to be done with extra care to not mess up with things like who owns the memory, and in which order it is deallocated. It's not enough to be a Rust engineer to be able to do it properly. Thankfully, we use fuzzing in the CI system, and these types of errors are found before the merge.

Panic

(Airplane!, Paramount Pictures, 1980)

One of the most visible downsides of Rust is the lack of exceptions (however, it is possible to hack around it). It uses error values instead and implements it fairly well (much better than, say, Go or C). However, it requires an overhead in the code for every possible exceptional situation. You have to do it not only for, say, an error in parsing a JSON from a user, or a socket disconnection, but also for low-level errors from the operating system. In C++, exceptions will propagate through the stack and could be reported by the server's query processing thread. In Rust, often people use "panic" when they don't want to take the overhead of handling the error, and "panic" will terminate the program. This is okay for batch applications, which cannot meaningfully handle these errors, but not ok for multi-tenant servers, which have many query-processing threads.

Fuzzers in our CI automatically found a problem, that certain misformatted PRQL queries can lead to panic and crash the server. The authors of PRQL promptly fixed the error. This is one of the advantages of having a C++ codebase - you learn to have so much power in your CI system, that it will easily find bugs inside any Rust code as well, see 1, 2, and 3.

To be honest, this is not different from a problem you find in many C libraries, when they use asserts to catch "impossible" errors, but our CI immediately proves these errors are very much possible. We are fixing Rust libraries in the same way how we fix C libraries. Also, panics in Rust actually work like exceptions by unwinding the stack, and this was also used as a solution.

Sanitizers

When you write in C++ you always use sanitizers. It's important to use four sanitizers: Address Sanitizer (checks invalid memory accesses), Memory Sanitizer (checks for uninitialized memory usage), Thread Sanitizer (checks for data races), and UB Sanitizer (checks for ill-formed code leading to undefined behavior), so we do it in our CI. We do it extremely thoroughly, as all sanitizers run for all types of test suites in every pull request and every commit in the master and release branches, and we do it with extra hardening, randomization, for stress tests, and fuzzing.

You might say, sanitizers are unneeded in Rust because Rust is blazingly safe. But the problem is - now we have a C++ application with a drop of Rust, and we must continue to use sanitizers for the whole application. What happens if C++ allocates a chunk of memory, then we call a Rust function to fill it with the data, and then C++ code will read from this memory? If we build with MSan, all memory writes have to be instrumented, so next time when someone reads from the memory, MSan can know that the memory was previously initialized. It means that every code, including Rust, has to be built with the instrumentation by MSan (another option is to call __msan_unpoison manually for the data from Rust).

Instrumentation with sanitizers is supported by Rust, but MSan is available only in the "nightly" (bleeding-edge) Rust toolchain, so we had to put efforts for specifying this version of the toolchain and pin it for a particular version, so it does not break over time.

For some reason, we also had to disable Thread Sanitizer for Rust.

Cross-compilation

ClickHouse builds with cross-compilation, which means - we can use x86_64 Linux machine (the "host platform" - where the build is done) to build ClickHouse for AArch64 mac (the "target platform" - is what we build for), or vice-versa. In fact, even if the host platform is identical to the target platform, e.g., when we build on x86_64 Linux for x86_64 Linux, we still use the cross-compilation mechanism, to ensure that our build process does not depend on the platform or the environment, but depends entirely on the source code from our repository. This allows hermetic and reproducible builds that anyone can validate.

ClickHouse is very portable. For example, people from IBM want to run ClickHouse on big-endian s390x systems, so we build it in our CI. My old friend, who lives in a garage with five cats wants to run ClickHouse on FreeBSD, so we build it for FreeBSD (the cats are nice).

It turns out it is not so easy in Rust. Well, maybe it is easy in Rust, and even easier than in C++, but when you use Rust and C++ together, it is not easy.

Library linking

ClickHouse build is restricted from using any libraries from the system - we isolate the build by making it independent of the operating system distribution. However, delta-kernel-rs depends on another library, named "reqwest", and it wants to link OpenSSL from the system, while we require static linking of the in-tree build of OpenSSL. We solved this problem.

Symbol sizes

C++ templates are notorious for slowing down compilation, bloating binary sizes, and generating very long symbol names (symbol name is the name of a piece of code or data in the binary file, used by the linker to bind different libraries together). But Rust can be even worse. It also has compile-time code generation, and it can go out of control as well. After adding PRQL, I was surprised by the binary size increase and found that symbols from PRQL approach 85 KB size (which is not the code, just the generated function names!). The author of PRQL promptly fixed this problem. Also, we disabled debug info for all Rust libraries, because as Rust is safe, who needs to debug it?

An example of a symbol name from Rust. It is about 300 times longer than what fits in the picture.

I found the offending code using my binary visualizer, which is available at the /binary handler in ClickHouse.

ClickHouse binary code visualized. Try to find Rust here.

Update: While writing this article I found that the problem is not fixed, and it became worse: now some symbols from Rust approach almost a megabyte in size. I have to fix it urgently.

Composability

There is a problem, when our C++ code uses some conventions to do things like managing connections to external storages, or accounting of allocated memory, and Rust code uses other conventions. We maintain connection pools for S3, we ensure that requests use just the right amount of parallelism, spread across multiple endpoints uniformly, and remain alive just the right time to allow proper rebalancing and avoid slow start, and connections use a proper exponential backoff, proven in practice. When anything in our code allocates memory, we account for it in the query context, to control the memory usage. But it’s hard to expect the level of composability from external libraries to support providing custom memory allocators, custom connection pools, etc. Maybe this is common in Haskell or Julia, but good luck using them. Inevitably, a Rust library will do something slightly differently, and then we will have to patch it. Now it means that we have to sink deeply into the Rust code. The amount of trouble was not evident at the start of the journey.

Build profiling and caching

Our build infrastructure supports distributed cache and profiling of the build process. Adding Rust means that we have to support the caching and profiling of Rust builds as well. In practice, it means that for a long period of time, we work without a build cache for Rust and pay more money for it.

Dependency management

Rust has a highly modular ecosystem, and libraries are composed much better than in C++. As a downside, Rust libraries typically have a large fan-out of dependencies, much like Node.js. This requires taking care to avoid the blow-up of dependencies, and to deal with annoyances of dependabot.

How is Rust going?

Rust is going great! If your friend is eager to write something in Rust, please invite them to ClickHouse, and we guarantee a warm welcome.

Get started with ClickHouse Cloud today and receive $300 in credits. At the end of your 30-day trial, continue with a pay-as-you-go plan, or contact us to learn more about our volume-based discounts. Visit our pricing page for details.

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...
Follow us
X imageBluesky imageSlack image
GitHub imageTelegram imageMeetup image
Rss image