Two weeks ago I posted an article about improving fwrite‘s performance. fwrite is normally used to store binary data in some custom pre-defined format. But we often don’t need or want to use such low-level functions. Matlab’s built-in save function is an easy and very convenient way to store data in both binary and text formats. This data can later be loaded back into Matlab using the load function. Today’s article will show little-known tricks of improving save‘s performance.
MAT is Matlab’s default data format for the save function. This format is publicly available and adaptors are available for other programming languages (C, C#, Java). Matlab 6 and earlier did not employ automatic data compression; Matlab versions 7.0 (R14) through 7.2 (R2006a) use GZIP compression; Matlab 7.3 (R2006b) and newer can use an HDF5-variant format, which apparently also uses GZIP (level-3) compression, although MathWorks might have done better to pay the license cost of employing SZIP (thanks to Malcolm Lidierth for the clarification). Note that Matlab’s 7.3 format is not a pure HDF5 file, but rather a HDF5 variant that uses an undocumented internal format.
The following table summarizes the available options for saving data using the save function:
|save option||Available since||Data format||Compression||Major functionality|
||R2006b (7.3)||Binary (HDF5)||GZIP||2GB files, class objects|
||R14 (7.0)||Binary (MAT)||GZIP||Compression, Unicode|
||R8 (5.0)||Binary (MAT)||None||N-D arrays, cell arrays, structs|
||All releases||Binary (MAT)||None||2D data|
||All releases||Text||None||Tab/space delimited|
HDF5 uses a generic format to store data of any conceivable type, and has a non-significant storage overhead in order to describe the file’s contents. Moreover, Matlab’s HDF5 implementation does not by default compress non-numeric data (struct and cell arrays). For this reason, HDF5 files are typically larger and slower than non-HDF5 MAT files, especially if the data contains cell arrays or structs. This holds true for both pure-HDF files (saved via the hdf and hdf5 set of functions, for HDF4 and HDF5 formats respectively), and v7.3-format MAT files.
Perhaps for this reason the default preference is for save to use –v7, even on new releases that support –v7.3. This preference can be changed in Matlab’s Preferences/General window (or we could always specify the –v7/-v7.3 switch directly when using save):
Over the years, MathWorks has fixed several inefficiencies when reading HDF5 files (ref1, ref2). Some of these fixes include patches for older releases, and readers are advised to download and install the appropriate patches if you do not use the latest Matlab release (currently R2013a). There are still a couple of open bugs regarding HDF5 performance and compression that may affect save.
One might think that due to the generic descriptive file header and the increased I/O, as well as the open bug, the -v7.3 (HDF5) format would always be slower than –v7 (MAT) format in save and load. This is indeed often the case, but not always:
A = randi(20,1000,1200,40,'int32'); % 48M int32s => 184 MB B = randn(500,1000,20); % 80M doubles => 78 MB ops.algo = 'test'; % non-numeric tic, save('test1.mat','-v7','ops','A','B'); toc % => Elapsed time is 11.940455 seconds. % file size: 114 MB tic, save('test2.mat','-v7.3','ops','A','B'); toc % => Elapsed time is 6.963135 seconds. % file size: 116 MB
In this case, the HDF5 format was much faster than MAT, offsetting the benefits of the MAT’s reduced I/O. This example shows that we need to check our specific application’s data files on a case-by-case basis. For some files –v7 may be better; for others –v7.3 is best. The widely-accepted conventional wisdom of only using the new –v7.3 format for enormous (>2GB) files is inappropriate. In general, if the data contains many non-numeric elements, the resulting –v7.3 HDF5 file is much larger and slower than the –v7 MAT file, while if the data is mostly numeric, then –v7.3 would be faster and comparable in size.
Surprisingly, we can often sacrifice compression to (paradoxically) achieve better performance, for both save and load, at the expense of much larger file size. This is done by saving the numeric data in uncompressed HDF5 format, using the savefast utility on the Matlab File Exchange, using the same syntax as save:
tic, savefast('test3.mat','ops','A','B'); toc % => Elapsed time is 3.164903 seconds. % file size: 259 MB
Even better performance, and similar or somewhat lower file size, can be achieved by using save’s uncompressed format –v6. The –v6 format is consistently faster than both –v7 and –v7.3, at the expense of a larger file size. However, save –v6 cannot save Unicode and class objects, and is limited to <2GB file sizes.
Another lesson here is that depending on the relative size of the numeric and non-numeric data being saved, different data format may be advisable. As the application evolves and the saved data’s size and mixture change, we might need to revisit the format decision. Here is a summary on one specific computer, using the numeric variable A (184 MB) above, together with a cell-array of varying size:
B = num2cell(randn(1,dataSize)); % dataSize = 1e3, 1e4, 1e5, 1e6
|Numeric data||Non-numeric data||save -v7.3||save -v7||save -v6||savefast|
|184 MB||0.114 MB||3.8 secs, 43 MB||9.3 secs, 40 MB||1.6 secs, 183 MB||2.1 secs, 183 MB|
|184 MB||1.14 MB||4.3 secs, 46 MB||9.5 secs, 40 MB||1.6 secs, 184 MB||2.1 secs, 186 MB|
|184 MB||11.4 MB||12.6 secs, 78 MB||9.9 secs, 41 MB||2.7 secs, 189 MB||11.1 secs, 219 MB|
|184 MB||114 MB||87.5 secs, 402 MB||13.9 secs, 50 MB||5.8 secs, 244 MB||85.2 secs, 544 MB|
As noted, and as can be clearly seen in the table, compression is not enabled for non-numeric data in the save –v7.3 (HDF5) option and savefast. However, we can implement our own save variant that does compress, by using low-level HDF5 primitives in m-code or mex c-code.
A general conclusion that can be drawn from all this is that in the specific case of save, the additional time for compression is often NOT offset by the reduced I/O. So the general rule is to use –v6 whenever possible.
Although save –v6 does not compress its data, it does store data in a more compact manner than Matlab memory. So, while our test set’s cell array held 114 MB of Matlab memory, on disk it was only stored within 55 MB (= 244 MB – 189 MB).
The performance of saving non-numeric data can be dramatically improved (and the file size reduced correspondingly) by manually serializing the data into a series of uint8 bytes that can easily be saved very compactly. When loading the files, we would simply deserialize the loaded data. I recommend using Christian Kothe’s excellent Fast serialize/deserialize utility on Matlab’s File Exchange. The huge gain in performance and file size when using serialized data is absolutely amazing (esp. for data types that -v6 cannot save), for all the save alternatives:
B = num2cell(randn(1,1e6)); % 1M cell array, 114 MB in Matlab memory B_ser = hlp_serialize(B);
|Saved variable||Matlab memory||save -v7.3||save -v7||save -v6||savefast|
|B||114 MB||83 secs, 361 MB||4.5 secs, 9.2 MB||3.5 secs, 61 MB||83 secs, 361 MB|
|B_ser||7.6 MB||1.21 secs, 7.4 MB||1.17 secs, 7.4 MB||0.93 secs, 7.6 MB||0.94 secs, 7.6 MB|
Serializing data in this manner enables save –v6 to be used even for Unicode and class objects (that would otherwise require –v7), as well as huge data (that would otherwise require >2GB files, and usage of –v7.3). One user has reported that the run-time for saving a 2.5GB cell-array of structs was reduced from hours to a single minute using serialization (despite the fact that he was using a non-optimized serialization, not Christian’s faster utility). In addition to the performance benefits, saving class objects in this manner avoids a memory leak bug that occurs when saving objects to MAT files on Matlab releases R2011b-R2012b (7.13-8.0).
When the data is purely numeric, we could use hdf5write or h5create + h5write, in addition to save and savefast (see related). Note that hdf5write will be phased out in a future Matlab release; MathWorks advises to use h5create + h5write instead. Here are the corresponding results for 184 MB of numeric data on a standard 5400 RPM hard disk and an SSD:
|hdfwrite||h5create + h5write (Deflate=0)||h5create + h5write (Deflate=1)||save -v7.3||save -v7||save -v6||savefast|
|File size||183 MB||366 MB||55 MB||42 MB||40 MB||183 MB||183 MB|
|Time (hard disk)||4.4 secs||14.3 secs||7.4 secs||6.1 secs||10.8 secs||4.2 secs||4.3 secs|
|Time (SSD)||2.1 secs||0.2 secs||4.5 secs||4.1 secs||9.5 secs||1.5 secs||1.6 secs|
As noted, Matlab’s HDF5 implementation is generally suboptimal. Better performance can be achieved by using the low-level HDF5 functions, rather than the high-level hdfwrite, hdf5read functions.
In addition to HDF5, Matlab also supports the HDF4 standard, using a separate set of built-in hdf functions. Despite their common name and origin, HDF4 and HDF5 are incompatible; use different data formats; and employ different designs, APIs and Matlab access functions. HDF4 is generally much slower than HDF5.
While save’s –v7.3 format is significantly slower than the alternatives for storing entire data elements, one specific case in which –v7.3 should indeed be considered is when we need to update or load just a small part of the data, on R2011b or newer. This could potentially save a lot of I/O, especially for large MAT files where only a small part is updated or loaded.
Great post Yair. I myself had recently discovered that -v7.3 wasn’t being my friend when compared to -v7 in one of my simulations. Large-data save operations consume 100% of a single CPU, and large imagesc() figures take *minutes* to save. I don’t need random access to .mat files, so I have moved back to -v7 to save all my data and figures.
It seems to me that the time devoted to compression could be reduced if MATLAB used multithreaded compression routines, where the compression would be CPU-bound rather than IO-bound. In Ubuntu I have started to use the pigz (parallel gzip) and pbzip2 (parallel bzip2) utilities, which are happy to utilize all of my CPUs when compressing data. Deliberate parallelization into N jobs doesn’t always yield a speedup of a factor of N, but in this case I think you could get pretty close.
Serialization functions are a great find. Thank you for sharing !