In this article, you can find the results of the ANN R&D which we did between other commercial data science & machine learning projects. It contains benchmarks of several data storage methods. The main focus was I/O optimization because nearly every project we do for our clients requires large amounts of data to be processed.
One of the key factors for effective work on neural networks is the time it takes to train it. The faster it is, the more test-evaluate iterations we can do. To address this issue, in our recent research we have teamed up with Servodata, official NVIDIA partner in Poland, to perform experiments on DGX Station, one of the most efficient personal machines to perform deep learning training on.
Here are the machine specs:
The biggest advantage of this setup are the 4x Tesla V100 cards connected via NVLink together, which gives roughly 500 teraflops power to be consumed.
During this research, we quickly found out that GPU performance was not the bottleneck anymore. In fact, we should focus on either moving everything onto the GPU or limiting CPU usage as much as we can. That’s why we took a step back - to think how can we make some things faster in our code, and this is how the so-called DGX Research began in TEONITE.
The first part of the research was finding different ways to store data and see how they can optimize our learning time - you can find their results below.
As we are constantly searching for the new technologies and ideas, we wanted to know what’s the best framework for such research.
And, as you surely know, NVIDIA is the leading company in the field of Deep Learning research. It was an obvious choice to ask them what’s their recommendation for such a research.
We asked which python framework would NVIDIA recommends as the best choice to effectively utilize DGX Station resources like NV-Link.
Shortly after, we received the answer. Generally speaking, all frameworks can utilize NVLink by using NCCL backend, also it’s important for the library to be able to utilize Tensor Cores since it gives an additional boost to the performance. NVIDIA specifies that Caffe/nvCaffe/Caffe2, Pytorch, and Tensorflow with Horovod support both features, and MXNet, while not supporting Tensorcores yet.
TensorFlow with Keras and PyTorch
The former (Keras) was amazingly simple and somehow magical, every step had a separate function to use, I didn’t have to dig into anything to make Deep Network going. This was great and I would definitely recommend it to someone who is new and wants to create something that works. But it was not something I wanted since I knew that I would have to optimize the process of how the network learns, and for this, I would need to mess with internal code and implementations to know what could I optimize. And since it involved digging into multiple levels of abstractions, I assumed it wouldn’t be easy. Of course, I could have used a low-level API, but I didn’t want to spend too much time digging into the docs and solving issues that may come up during this. I needed something in between.
The latter (PyTorch) was not so simple at first glance, but it was very “pythonic”. This means it used native python structures like loops or list interfaces, and it is something I’m very familiar with. That’s why it was so encouraging to change something in the code, even during the tutorial. For example, the Dataset interface in PyTorch was exactly the same as python list interface, Dataloader interface was the same as the iterator interface, there was no such thing as "train and evaluate from TensorFlow, even the backpropagation and gradient optimization must have been written manually. And yet, it was not so hard, since there were many good examples on the web, which you can just copy and they work just fine. Since it requires much boilerplate to start I wouldn’t recommend it if you’d like to make something just to work, but for my purpose it’s great.
I was also considering Caffe2 and MXNet frameworks. They’re both following different concepts and are also present in many research documents and tutorials. I will surely come back to them in another research and see which one can utilize the hardware the best and which one is easiest to use.
In all frameworks I used, I wanted to create a neural network which would solve the Dogs vs Cats problem, since it has moderate data size, is easy to understand and it’s quite funny too!
Starting the research
As the starting point for the optimizations, I took the code from here, stripped it out from arg parsing to make it clearer.
The main metric we were concerned about was the time needed to load one batch of data. So, using example code and a simple python class for time measuring, I ran the network, and this gave me the first results of its performance.
You can see that pretty much all time needed to complete the batch is taken by data loading.
I wondered - Why is it so high? Why does it take 66 ms to load a single batch of data? Is the data loader implementation so slow? Or maybe it’s the dataset fault?
Frankly, when I removed the print (the logs from which you could see above) the results have changed drastically!
Why was the print taking so long? We can clearly see that data loading time is not the cause of the batch time taking 76 ms, with an average of 1.02 ms for a batch.
The funny thing is - I realized that the print had to be removed after I implemented several different dataset implementations and even my custom loader!
The main problem resolved, but I still have all the different implementations in place. Why not test them and make some benchmarks? Of course, there is a benchmark describing read/write times and sizes of different methods to store the arrays, but I didn’t found any benchmark related to loading batches of data, and, ultimately, that’s what we’re interested in when writing neural networks.
Right now I have the following implementations of data storage libraries to test:
- ImageFolder (built-in PyTorch implementation from torchvision.datasets.
- HDF5 (accessed via h5py)
- .npy files
- .pt files
Also, there are three different data loaders to test:
- built-in with multiple workers
- built-in with a single worker
Since our time for experiments on DGX Station came to an end, we had to use our deep learning machine for this benchmark. Its specs are listed below.
Class Description ========================== system MS-7998 bus Z170A SLI PLUS (MS-7998) memory 64KiB BIOS memory 128KiB L1 cache memory 128KiB L1 cache memory 1MiB L2 cache memory 8MiB L3 cache processor Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz memory 16GiB DIMM DDR4 Synchronous 2400 MHz (0,4 ns) memory 16GiB DIMM DDR4 Synchronous 2400 MHz (0,4 ns) display GP104 [GeForce GTX 1070] disk 525GB Crucial_CT525MX3 disk 2TB WDC WD20EZRX-22D
In the next sections, you’ll find more info about each dataset and custom loader, including the reason why these methods were chosen. If you’re just interested in the results, jump into Results section.
The main goal in writing different datasets implementations was to perform as many operations, like transformations, as possible, before we even start learning. This way we would save some time in the loop itself and feeding the data would be faster.
Of course, I’m aware that this also limits the variety of the data. Operations like RandomCrop or RandomHorizontalFlip can make the data more distinct and therefore the neural network will be better trained. However, operations, like decoding the image or normalizing the data for a pre-trained network, can be done once without any negative impact. This is something that could be optimized more.
The default behavior of the built-in dataset is to load the image and perform transformations every time a sample if fetched. That’s what we’d like to optimize.
As I started to search for information about how to optimize I/O in PyTorch I found this answer. Since it had a source code attached and was described as very effective I gave it a try.
First, I wrote a simple notebook which created .hdf5 files from my data. As I checked it, it was working fine, values were correctly written and read, and it was pretty fast. Reading one transformed image array has the following timing:
CPU times: user 394 µs, sys: 16 µs, total: 410 µs Wall time: 413 µs
There was one problem - as I used this in the neural network, the errors occurred
cuda runtime error (59) : device-side assert triggered at /.../THC/generic/THCStorage.c:32
What is that? After hours of digging, I found this thread with one comment which gave me a clue
labels should not be -1, they should be within the range of [0, n_classes]
So let’s check the labels.
20184 Label: 0 18036 Label: 0 18180 Label: 4557061319748400051 18139 Label: 0 19968 Label: 0 21361 Label: 4574336876509936858 6011 Label: 4603346812213172903 16288 Label: 0 2402 Label: 0 21021 Label: 0 8688 Label: -4641372210688022970 16819 Label: -4639454095471887411 8812 Label: -4658475019568654485 17392 Label: 0 20605 Label: 4592043660878312063 2743 Label: -4638287290297841520 4968 Label: 0 2162 Label: -4653799677086183853
Why are those labels so large? Concurrency. I came back to the thread where I saw hdf5 and saw this answer:
It looks like HDF5 has some concurrency issues. My suggestion of using it is probably not appropriate when you use several workers. I often use one worker because my networks are computationally heavy and I’m not limited by the data iterator. Perhaps you should try other approaches like Zarr which have been designed to be thread-safe.
That’s where the next storage came in - Zarr.
But, I didn’t stop there with HDF5. I wanted to check if I could make it thread-safe. As I read in the docs, I had to install HDF5 from source with --enable-parallel --enable-shared flags.
After a couple of hours, I had a ready-to-use Docker image with HDF5 and h5py installed from the source.
But the result was still the same - no matter which flag I used, I still got the same errors in concurrency. If you happen to know the answer to how can I make it work - please contact me.
That’s why I strongly advise against using HDF5 with more that one worker
As I mentioned above, my first contact with Zarr was in the same topic where I found HDF5. Also, during the research about the latter one, I found a blog called “To HDF5 and beyond” which was written by Zarr creator - Alistair Miles. In it, he describes what he has learned from HDF5 and Bcolz libraries and how he combined these approaches into one library, at the same time eliminating downsides of them.
One of the core elements of this library are chunks. The data is kept split, and a chunk is only loaded from disk when you actually need it. What I thought would be nice, was to combine chunk size with batch size, therefore one chunk would cover the size of one batch of the data.
That’s why Zarr implementation was split into two approaches:
- One using just fetching random id’s of the data, without carrying much about chunks,
- Second, where randomization is made on chunk level, selecting at random which chunk to load, not specific id. This required to write whole Datastore implementation, where you have to pass batch size first and then everything is oriented around this number
.npy / .pt files
The last approach, with just saving the transformed arrays on disk and then loading them up as I need them, was the first thing that came to my mind as I began this research. It’s very easy to set up, easy to understand, no magic in between.
At first, I didn’t want to do anything about the loaders themselves. But, as I read [this article], and looked at the interface and implementation of the Dataloader, I thought I can make something simpler, which would not have any sampler or data iterator, which could be potentially slowing down the data loading.
To give the best comparison, as my custom loader is operating on a single thread, I used built-in loader in two ways - one with 4 workers and second with only 1, so that it would be the on the same concurrency level as my custom one.
Also, Zarr Datastore implementation, which I mentioned earlier also needed its separate Dataloader which would utilize chunks mechanism.
Combining it all together, there are 6 different datasets implementations and 4 distinct loaders.
Given that Zarr Datastore has its own loader, and cannot be used with the built-in one, we have a total of 5 datasets to test against 3 loaders and this datastore with its loader.
Time measurement code is pretty simple:
def measure_time(loader): own_data_time = AverageMeter() end = time.time() for index, (target, label) in enumerate(loader): own_data_time.update(time.time() - end) end = time.time() return own_data_time
All it does is, it measures how long each batch took to load in a loop.
The average time of loading a single batch is shown below.
As you can see, the best and the worst implementation belong to Zarr, and there’s a huge difference in time when using chunks (over 200x faster!).
My custom loader performed pretty much the same as the built-in one with a single worker, and since the latter one gives plenty more options to configure, it is the better choice.
What is interesting is that Zarr chunked implementation on a single worker is much faster than 4 workers reading simultaneously from plain tensor files or built-in dataset, however, it’s the most complex implementation I did during this research, and it may be hard to maintain it or make modifications.
There is also one more thing which we need to take into consideration - actual time to load the data when the neural network is learning. This time is used by the built-in data loader to pre-fetch the data.
That’s why, in the results above, built-in concurrent loader with built-in dataset took 17ms per batch to load, whereas, as mentioned in “Starting the research” section, in the training itself it took only 1.02 ms per batch.
I’ve updated time measuring code so that it still measures the batch loading time, but also trains a new neural network with this data, giving the data loader some more time for pre-fetching.
As you can see, the best result is from HFD5 - 0.84 ms per batch. Compared to the original implementation, which this time took 1.1 ms per batch, it’s 30% faster.
First of all, I found digging into PyTorch interfaces and trying to understand how it works under the hood, very entertaining. Everything is very “Pythonic”, there are no magic functions, everything is well documented and/or described on a community forum.
I’m also aware that there are other great formats to store the data (like LMDB, H2, LevelDB etc.), so if you’d like to try implementing your own PyTorch dataset and see its performance compared to other ones, feel free to contribute and share your results. On my end, I’ll update this paper as the new results will come up. The code used for this benchmark, including dataset implementations, time measuring functions and Dockerfiles for setting up the environment will be shared as open-source.
Also, If you have any clues or ideas how to make this benchmark better, submit an issue on GitHub or post a comment here, and maybe it will be an inspiration for others to make a better benchmark or implementation!