(Chinese) https://www.high-flyer.cn/blog/3fs/
This file system has been developed and utilized by them for several years .
Compared to the traditional file systems, it is more focused on model training that contains a lot of random reads. Read cache and prefetching are useless in this case. Therefore, they designed the file system without those features to improve the performance.
I google translated some key parts here:
3FS is a special file system because it is almost only used in the scenario of batch reading sample data in the computing node during AI training, and accelerates model training through high-speed computing and storage interaction. This is a large-scale random reading task, and the read data will not be used again in a short time, so we cannot use the most important tool "read cache" to optimize file reading, and even advance reading is useless. Therefore, the implementation of 3FS is also quite different from other file systems.
Specifically, as shown in the figure above, 3FS uses the Linux-based AIO and io_uring interfaces to complete sample reading, because in the 3FS scenario, File Cache has no effect at all, but will consume system memory in a way that is difficult for users to control, affecting the operation of subsequent tasks, so we turned off File Cache and only used Direct I/O mode to read data. But it should be noted that when reading in this way, the buffer pointer, offset and length all need to be aligned. If the user is allowed to do this alignment, additional memory copies will be generated, so we have done the alignment inside the file system, which not only optimizes performance but also facilitates users.
Put another way: in my experience, supporting fast random reads is a challenging problem, while supporting high sequential reads is fairly straightforward. When is random access to a training set absolutely necessary for training a model?
(Random) You shuffle the deck every time you go through it. You're forced to learn the images and their classifications without relying on any specific sequence, as the data has no signal from sequence order.
(Fixed order) Every time you go through the deck, the images appear in the exact same order. Over time you may start to unconsciously memorize the sequence of flashcards, rather than the actual classification of each image.
When it comes to actually training a model, if the batches are sampled sequentially from a dataset, it risks learning from correlations caused by the sequencing of the data, resulting in poor generalization. In contrast, when you sample the batches randomly, the model is biased and encouraged to learn features from the data itself rather than from any signals that arise from artifacts of the ordering.
Imagine that you have a sequence of numbers. You want to randomly select a window of, say, 1024 consecutive numbers, a sequence, as input to your model. Now, say, you have n items in this sequence, you want to sample n/c (c is a constant and << 1024) sequences in total. How to do fixed shuffle?
The key is, we have overlap in data we want to read. If we brute force fixed shuffle and expand, we need to save 1024/c times more than original data.
This isn't useful for LLMs, but hey, wonder how it started?
As an ML infra guy I have had to debug a lot of failing jobs over the years, and randomizing datapipes are one of the hardest to debug. Sometimes there will be a "record-of-death" that randomly gets shuffled into a batch, but only causes problems when it is (extremely rarely) coupled with a few other records.
I guess I'll just have to update my priors and accept that inline synchronous randomization with random reads is a useful-enough access pattern in HPC that it should be optimized for. Certainly a lot more work and complexity, hence my question of just how necessary it is.
Building a system for serving read-only data at NVMe SSD speed (as in IOPS) took surprisingly few effort, and is mostly enough for training data. Kudos to DeepSeek who decided to spend extra effort to build a full PFS and share it.
It's certainly not true on actual hard drives, and never has been. A seek is around 10ms.
I don't like comparing the two, they're completely different workloads and it's better IMO to look at the IOPS for random transfers, which is where newer, faster SSDs truly excel, and where most people "notice" the performance.
If build a cache that gets hits on the first pass, then it won't work for the second and later passes.
The infra work is usually technically tedious so I think it may become some lost art in the west just like those manufacturing jobs.
What's going on here, why are people forgetting what's around them? Does familiarity breed contempt? Are attention spans so shot that failure to participate in this week's news cycle is enough for "out of sight, out of mind"? Or is HN full of Chinese bots now?
My hypothesis is that there is not such a big difference at all. All three of the companies you mentioned are world class competitors in this. DeepSeek were the last to have a "hit" but that isn't an indication that they'll be the next of the three (or other yet unknown entities) to have the next hit. We try to predict what happens next now but perhaps we should rather focus on who or what we want to succeed. For me it's quite clear: it should be open source or I'm long term not that interested.
"If you are pursuing short-term goals, it is right to find people with ready experience. But if you look at the long-term, experience is not that important. Basic skills, creativity, and passion are much more important.”
OpenAI et. al kind of have also been very deep down the systems rabbit hole (eg. Triton), but I can't think of anyone else (outside of Google/Facebook) who pay this amount to attention to things.
Great work; hope Deepseek does even more awesome things going forward.
In terms of fast FUSE - also my first question, appears to be`io_uring` + FUSE :)
https://github.com/deepseek-ai/3FS/blob/main/src/lib/api/Usr...
I believe there’s work to minimize this using io_uring so that you can talk to the fuse driver without the kernel being in the middle, but that work isn’t ready last time I checked.
For what it’s worth at Palm we had a similar problem because our applications were stored compressed but exposed through fuse uncompressed, instead of O_DIRECT I just did an fadvise to dump the cache after a read. Not as high throughput but the least risky change to get the same effect.
So has uncached buffered IO: https://www.phoronix.com/news/Uncached-Buffered-IO-Linux-6.1...
6.14 is an exciting kernel!
arXiv:2408.14158v2 [cs.DC] 31 Aug 2024
"Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning"
Abstract:
"The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC."
Have we in the valley companies lost touch?
the highflyer team are pretty well resourced.... think they have more than 10 people
That's with 14 unnamed SSD per node. I wonder how this would scale to higher end SSD,dealing from PCIe 4 to PCIe 5 or PCIe 6... Particularly whether one could scale down!
What are we going to see tomorrow? DeepSeek OS or something?
I have a theory as to why...
It is really frustrating to see good engineers go to play trading games. We should study how exactly it is China managed to unlock this capacity.
Why do you think this would be controversial? This isn't an every day work.
Does it really matter whether it's illegal or not, if there is no enforcement? Pinduoduo (in other name, Temu) has been doing 70 hours week since they started. Yes, they are still doing it right now.
Also, what specifically is the data access patterns for training and inference that are different from traditional use cases?
You can try to use "standard" options like MinIO/Ceph(RADOS)/SeaweedFS but you will very quickly learn those systems aren't remotely fast enough for these usecases.
AI training is what this is used for, not inference (which has absolutely no need for any filesystem at all). What makes the workload somewhat special is that it's entirely random read and not cacheable at all as most reads are one and done.
Would Lustre be perfectly fine at 6TiB/s? Yes. Is it a huge pain in the ass to operate and make remotely highly available? Also yes. If this thing is capable of the throughput but easier to operate and generally more modern and less baroque it's probably an improvement. TLDR is Lustre is fast but that is literally it's only redeeming quality. I have lost far too many hours of my life to the Lustre gods.
I'll add that latency also doesn't matter that much. You are doing batched data loading for batch n+1 on CPU when GPUs are churning batch n-1 and copying batch n from host memory at the same time.
So as long as your "load next batch" doesn't run for like >1s it would be fine. But one single "load next batch" on one worker means thousands (if not more) random read.
The Ceph team has been working on Crimson for years to get past performance bottlenecks inherent to the HDD-based design. I’m having troubles finding any ceph benchmark results that show any close to 100 GB/s.
Ceph: 68 nodes, 2x100Gbps Mellanox and 10x 14TiB NVMe SSDs per node, 504 clients, 1TiB/s of FIO random read workload
I also assume that the batch size (block size) is different enough that this alone would make a big difference.
Ceph cluster achieves 1 TiB/s / 1.7 TiB/s = 0.58% of theoretical throughput.
3FS cluster achieves 6.6 TiB/s / 9 TiB/s = 0.73% of theoretical throughput.
The hardware in the Ceph test is only capable of max 1.7TiB/s traffic (optimally without any overhead whatsoever).
I also assume that the batch size (block size) is different enough that this alone would make a big difference.
EDIT: Their blog post answered all my questions and more. https://www.high-flyer.cn/blog/3fs/
The only competitors in the parallel FS space that are useful for this are Lustre and Weka.
Otherwise if you don't need a single namespace a bunch of fat AF NFSv4 servers w/NFS over RDMA will also get you to 6TiB/s.
The "surefire" way though is still Lustre, it's the big daddy of distributed parallel filesystems still but it's an absolute beast to setup and operate.
I agree that a lot of "modern" storage stack is way too slow though, tried to find a replication-first object storage for crazy-fast random read in small number of objects last year and found none.
Why this number - this is because it’s roughly the time it takes to read 64 bytes from L3 cache. And NICs tend to be able to push data into L3 (or equivalents).
Current state of the art - look up nanoPU, from Stanford. Wire-to-wire under 100ns is not impossible, but this would normally assume pre-cooked packet, selected from a number of packets (which is not an unusual scenario in HFT).
The thing is those stuff are so prevalent those in house tech have reach the point they are competitive. This doubles for quant firm like DeepSeek.
Can't wait to see what they release next. DeepSeek should be studied carefully.