HuggingFace datasets uses Apache Arrow, which can either be loaded into RAM or streamed off the disk (memory mapped). When storing to a repository they convert to Parquet (compressed), splitting into parts as necessary to keep each individual parquet file below half a gig. When loading from a repository it's converted back to Arrow and stored in a cache directory.
Since HuggingFace is fairly popular I figure they've probably put more thought into file format choices than I have. So for my last two projects that's what I've used, and it's worked just fine thus far. At least on my datasets in the 40GB, 5 million row range.
I just wish the documentation on Arrow was better, and that the Rust libraries were more complete.