Love and Hate to Rust -- Two Years' Journey of a Data Engineer

To be clear, I'm not evangelist of the language XD, some folks may have strong opinions against it. This article is just purely based on my 2 years' usage of the Rust programming language and its ecosystem. And I want to share my first hand knowledge, using my 5 years' experience in data engineering, to tell you the great things and the bad things Rust has to offer in this field.

1. Start with Python 🐍

About 5 years ago, when I was tired of being a python script kiddie, I was very lucky to be hired by a startup data company, where I learned everything about this industry -- batching, streaming, lakehouse, warehouse, extraction, transformation, ingestion... The company has different clients, big and small. Some has to deal with billions of records in their DBMS, some needs to do complex analytical computation on their private servers. And I was on the team responsible for our ETL toolkits, which we developed and integrated as a user friendly low-code platform. It was built around Pandas and Apache Spark, Pandas is for exploration and previewing, PySpark for batch processing, a very common data handling architecture. So naturally for me, I started working on it very quickly with my Monty Python knowledge.

Not so long after, however, we had some hard time with this architecture. I could spin up a chart/sheet very quickly using Pandas, but when the sample dataset became larger than usual sometimes, the OOM killer kicked in very timely. Classic pandas experience, and we had to increase the memory limit temporarily for it. Sometimes the data is too skewed, making one spark worker node crash repeatedly, dragging the whole spark cluster into a death loop. So we had to redo the partitioning, and the process is kinda slow and painful. You have to introduce some trial and error when the dataset is too large, doing any reasonable insights on it is very time and resource consuming. The story goes on and on...

2. Meet Rust 🦀

So, Pandas is smart, but it eats up exponential memory with bigger datasets. Spark can do the heavy lifting, but it's slow and not so flexible. Is there any tool smart/flexible enough, not so resource hungry, and can even do some huge data processing? We quickly eyed on the then newly born Polars project, which is mainly developed in Rust. It offers a similar experience with Pandas, and also provides a python package, so the most common things you do in Pandas can be done in Polars, just faster and using less memory, which is very convenient. However, Polars was not very feature complete then, many complex transformations was missing from it and it was not very extensive to develop with easily. And it does not feature a complete SQL engine, which was also needed by us.

Here comes Apache DataFusion then. It boasts a truly extensive query engine, also written in Rust, provides equal or better performance compared to Polars, supports hugely flexible data sources and data transformations. So we quickly did some experiments with it and was very satisfied. At that time, not many DBMSs was officially supported by it, we shamelessly dig from the Polars dependencies and found the connector-x crate, which is a high performance Rust connector to all kinds of DBMSs (BTW I made several PRs and were merged to it afterwards), and with several adapter traits and structs, we were able to connect DataFusion to all these MySQL, Postgres, Oracle, SQL Server... you name it, which was mind boggling and insanely productive. The Rust ecosystem was that vibrant and abundant and it's still thriving!

3. The Good Time

After several testing deployments replacing Spark, we were very satisfied with our DataFusion based query engine. It generally outperformed Spark not only in resource consumption, but also in speed. Even with much smaller datasets, it still managed to beat Spark due to node scheduling overhead in Spark, making realtime query execution possible. Clients and on-site engineers were also kinda surprised by this large improvement. So we decided to also publish it as an additional OLAP service into out ETL toolkits platform. Then things went even more smoothly (or rusty? lol). Thanks to the highly extensive nature of DataFusion and the fantastic tooling ecosystem of Rust, we managed to develop several SDK packages in mere weeks for our newly developed query engine, supporting Python and Go, making our other existing services blew their old versions out of the water as well, which blew our minds once again.

My first year with Rust went fly without me even realizing it. We had been busy constantly improving our query engine, building more supporting tools, online services, adding supports for more DBs, files and storage systems, doing all kinds of customizations for clients, and so on. Two full-time golang developers in our team were also converted to Rust (Rust is not cult, I promise). Overall, the Rust ecosystem is amazing, it's lively and rapidly evolving. You can find packages for all kinds of common functionalities on the public crates.io repository. There is an official dedicated docs.rs site for documents. The Rust book and the Rust async book are extremely helpful even to complete newbie programmers. Also, don't forget rust-by-example, the rust playground, and a very nice tour of rust if you are impressed by the tour of golang, and so on. There are countless Rust tutorials on the Internet, different types of learners can choose the best ones for themselves. Moreover, there're several online forums if you need extra help, like the r/rust subreddit, or the official rust users' forum. Many have the impression that Rust communities are toxic, hostile, or hard to work with. I can't say for everyone, but on our team, we generally find people in Rust communities are very eager to help. There is no "deep" knowledge kept away from new comers, hardly any "RTFM" humiliating comments which are quite common on the Stack Overflow sites, and I think are rude and unproductive. And to be fair, I personally find /r/rust much more helpful than SO, on /r/rust people can rant about all sorts of things Rust, help requests, advice seeks, new creation showoffs...serious or casual, they are not restricted to the rigid Question then Answer paradigm, people just exchange their knowledge freely, no superior or inferior. Though they can't avoid some issues from reddit's side, like information is hard to archive and search, formatting is messy, etc., but I think it's good enough for a free community for the time being.

4. The Bad Time

About another year went by, our team landed more clients than ever. But some little things we thought irrelevant went messy fast. First, we found the CI became unbearably slow while test builds on multiple branches were waiting in queue. It turned out to be a caching problem of the cargo building architecture, combined with the complex CI environment, there were no sound solution to this. We had to seek to dark magic to mitigate (I also wrote a post about this in Chinese). One day our devops colleague came up and ask what we were doing on the CI machine, our jobs consumed most of the CPU time, complaints from other teams were made from time to time, because their jobs kept crashing everyday. Yeah, you probably already knew it, cargo build is quite CPU intensive, and by default it eats up all the cores. We had to set limits to the builder containers, but the building time dragged long and long again. To reduce the wait for any changes to be landed, we even tried to broke our monolithic project into a modular, microservice design gradually, but quickly found out that the maintenance cost is punishing. You may ask, why not throw more machines to it? Truth is, we were still a startup, more machines calls for more maintenance efforts. And the CI cluster we had was by no means weak, around 256 cores, 1TB of ram in total, and a JBOD array god knows how many terabytes of disks it packed within. We were desperate to try to find ways to cut down the wait for compiling.

5. How We Improved

Here are some strategies we tried and had applied to the dependencies in cargo.toml. The core idea of it is being very careful about which crate and which version to pull in, in order to shrink the dependency tree as best as we can. a) Remove any unnecessary, or "left-pad" like dependency, re-implement in our own code base if possible. And now we know that the cargo build directory is just as heavy as node_modules.😅 b) Hand pick versions of the dependencies. Usually, there are overlapping among dependencies, dependencies of dependencies and so on, but just with different versions. Choose wisely on crates with a huge dependency tree, so that the result dependency tree could be as low and narrow as possible. c) If method b) doesn't work very well for some particular crate, mirror it on our own git server, re-export and/or modify some of its dependency. Sometimes this method might need us mirroring several other crates in a row, so we only use this if absolutely necessary.

Generally, these strategies helped us cut down about 50% of the dependency tree, and test build pipeline took us around 30 min with a limit of 16 CPU cores at that time. But it was still to wasteful compared to our non CGO golang projects, which generally take only about one or two minutes. And we had to find other ways. One thing we first thought about was the feature flag. Most of the time when we did the test builds, the relavant code paths were only limited to several specific packages within the code base. After we gated several lesser used packages behind feature flags, the time for one test build came down to around 20 mins. That was a nice but not that significant improvement.

Another thing we changed is the cargo build profile, see the Cargo Document on this topic. We were using release profile for the test builds before, because some test routes on non optimized debug build profile were too slow. We played with several options then decided to tune down the opt-level and lto options, along with the usage of mold the multithreaded modern linker, the build time was brought down to about 10 mins or less.

6. To be Continued

At this point, 10 mins for us is tolerable for now. But the rabbithole is way more deep, many more aspects we could still improve upon. And there are more rusty things I wanna share with you. More posts coming in...