ago
0 like 0 dislike
0 like 0 dislike
If you are using some more odd formats, then what format do you use? Personally found webdataset promising but what other formats are there and why do you use it? Or if you are using the original file how do you ensure good throughput and shuffling?
ago
0 like 0 dislike
0 like 0 dislike
HuggingFace datasets uses Apache Arrow, which can either be loaded into RAM or streamed off the disk (memory mapped).  When storing to a repository they convert to Parquet (compressed), splitting into parts as necessary to keep each individual parquet file below half a gig.  When loading from a repository it's converted back to Arrow and stored in a cache directory.

Since HuggingFace is fairly popular I figure they've probably put more thought into file format choices than I have.  So for my last two projects that's what I've used, and it's worked just fine thus far.  At least on my datasets in the 40GB, 5 million row range.

I just wish the documentation on Arrow was better, and that the Rust libraries were more complete.
ago

No related questions found

33.4k questions

135k answers

0 comments

33.7k users

OhhAskMe is a math solving hub where high school and university students ask and answer loads of math questions, discuss the latest in math, and share their knowledge. It’s 100% free!