Undocumented Matlab
  • SERVICES
    • Consulting
    • Development
    • Training
    • Gallery
    • Testimonials
  • PRODUCTS
    • IQML: IQFeed-Matlab connector
    • IB-Matlab: InteractiveBrokers-Matlab connector
    • EODML: EODHistoricalData-Matlab connector
    • Webinars
  • BOOKS
    • Secrets of MATLAB-Java Programming
    • Accelerating MATLAB Performance
    • MATLAB Succinctly
  • ARTICLES
  • ABOUT
    • Policies
  • CONTACT
  • SERVICES
    • Consulting
    • Development
    • Training
    • Gallery
    • Testimonials
  • PRODUCTS
    • IQML: IQFeed-Matlab connector
    • IB-Matlab: InteractiveBrokers-Matlab connector
    • EODML: EODHistoricalData-Matlab connector
    • Webinars
  • BOOKS
    • Secrets of MATLAB-Java Programming
    • Accelerating MATLAB Performance
    • MATLAB Succinctly
  • ARTICLES
  • ABOUT
    • Policies
  • CONTACT

Improving save performance

May 8, 2013 19 Comments

Two weeks ago I posted an article about improving fwrite‘s performance. fwrite is normally used to store binary data in some custom pre-defined format. But we often don’t need or want to use such low-level functions. Matlab’s built-in save function is an easy and very convenient way to store data in both binary and text formats. This data can later be loaded back into Matlab using the load function. Today’s article will show little-known tricks of improving save‘s performance.
MAT is Matlab’s default data format for the save function. This format is publicly available and adaptors are available for other programming languages (C, C#, Java). Matlab 6 and earlier did not employ automatic data compression; Matlab versions 7.0 (R14) through 7.2 (R2006a) use GZIP compression; Matlab 7.3 (R2006b) and newer can use an HDF5-variant format, which apparently also uses GZIP (level-3) compression, although MathWorks might have done better to pay the license cost of employing SZIP (thanks to Malcolm Lidierth for the clarification). Note that Matlab’s 7.3 format is not a pure HDF5 file, but rather a HDF5 variant that uses an undocumented internal format.
The following table summarizes the available options for saving data using the save function:

save option Available since Data format Compression Major functionality
-v7.3 R2006b (7.3) Binary (HDF5) GZIP 2GB files, class objects
-v7 R14 (7.0) Binary (MAT) GZIP Compression, Unicode
-v6 R8 (5.0) Binary (MAT) None N-D arrays, cell arrays, structs
-v4 All releases Binary (MAT) None 2D data
-ascii All releases Text None Tab/space delimited

HDF5 uses a generic format to store data of any conceivable type, and has a non-significant storage overhead in order to describe the file’s contents. Moreover, Matlab’s HDF5 implementation does not by default compress non-numeric data (struct and cell arrays). For this reason, HDF5 files are typically larger and slower than non-HDF5 MAT files, especially if the data contains cell arrays or structs. This holds true for both pure-HDF files (saved via the hdf and hdf5 set of functions, for HDF4 and HDF5 formats respectively), and v7.3-format MAT files.
Perhaps for this reason the default preference is for save to use –v7, even on new releases that support –v7.3. This preference can be changed in Matlab’s Preferences/General window (or we could always specify the –v7/-v7.3 switch directly when using save):

Matlab's preferences for saving binary data
Matlab's preferences for saving binary data


Over the years, MathWorks has fixed several inefficiencies when reading HDF5 files (ref1, ref2). Some of these fixes include patches for older releases, and readers are advised to download and install the appropriate patches if you do not use the latest Matlab release (currently R2013a). There are still a couple of open bugs regarding HDF5 performance and compression that may affect save.
One might think that due to the generic descriptive file header and the increased I/O, as well as the open bug, the -v7.3 (HDF5) format would always be slower than –v7 (MAT) format in save and load. This is indeed often the case, but not always:

A = randi(20,1000,1200,40,'int32');  % 48M int32s => 184 MB
B = randn(500,1000,20);              % 80M doubles => 78 MB
ops.algo = 'test';                   % non-numeric
tic, save('test1.mat','-v7','ops','A','B'); toc
% => Elapsed time is 11.940455 seconds.   % file size: 114 MB
tic, save('test2.mat','-v7.3','ops','A','B'); toc
% => Elapsed time is 6.963135 seconds.    % file size: 116 MB

A = randi(20,1000,1200,40,'int32'); % 48M int32s => 184 MB B = randn(500,1000,20); % 80M doubles => 78 MB ops.algo = 'test'; % non-numeric tic, save('test1.mat','-v7','ops','A','B'); toc % => Elapsed time is 11.940455 seconds. % file size: 114 MB tic, save('test2.mat','-v7.3','ops','A','B'); toc % => Elapsed time is 6.963135 seconds. % file size: 116 MB

In this case, the HDF5 format was much faster than MAT, offsetting the benefits of the MAT’s reduced I/O. This example shows that we need to check our specific application’s data files on a case-by-case basis. For some files –v7 may be better; for others –v7.3 is best. The widely-accepted conventional wisdom of only using the new –v7.3 format for enormous (>2GB) files is inappropriate. In general, if the data contains many non-numeric elements, the resulting –v7.3 HDF5 file is much larger and slower than the –v7 MAT file, while if the data is mostly numeric, then –v7.3 would be faster and comparable in size.
Surprisingly, we can often sacrifice compression to (paradoxically) achieve better performance, for both save and load, at the expense of much larger file size. This is done by saving the numeric data in uncompressed HDF5 format, using the savefast utility on the Matlab File Exchange, using the same syntax as save:

tic, savefast('test3.mat','ops','A','B'); toc
% => Elapsed time is 3.164903 seconds.   % file size: 259 MB

tic, savefast('test3.mat','ops','A','B'); toc % => Elapsed time is 3.164903 seconds. % file size: 259 MB

Even better performance, and similar or somewhat lower file size, can be achieved by using save’s uncompressed format –v6. The –v6 format is consistently faster than both –v7 and –v7.3, at the expense of a larger file size. However, save –v6 cannot save Unicode and class objects, and is limited to <2GB file sizes.
Another lesson here is that depending on the relative size of the numeric and non-numeric data being saved, different data format may be advisable. As the application evolves and the saved data’s size and mixture change, we might need to revisit the format decision. Here is a summary on one specific computer, using the numeric variable A (184 MB) above, together with a cell-array of varying size:

B = num2cell(randn(1,dataSize));  % dataSize = 1e3, 1e4, 1e5, 1e6

B = num2cell(randn(1,dataSize)); % dataSize = 1e3, 1e4, 1e5, 1e6

Numeric data Non-numeric data save -v7.3 save -v7 save -v6 savefast
184 MB 0.114 MB 3.8 secs, 43 MB 9.3 secs, 40 MB 1.6 secs, 183 MB 2.1 secs, 183 MB
184 MB 1.14 MB 4.3 secs, 46 MB 9.5 secs, 40 MB 1.6 secs, 184 MB 2.1 secs, 186 MB
184 MB 11.4 MB 12.6 secs, 78 MB 9.9 secs, 41 MB 2.7 secs, 189 MB 11.1 secs, 219 MB
184 MB 114 MB 87.5 secs, 402 MB 13.9 secs, 50 MB 5.8 secs, 244 MB 85.2 secs, 544 MB

As noted, and as can be clearly seen in the table, compression is not enabled for non-numeric data in the save –v7.3 (HDF5) option and savefast. However, we can implement our own save variant that does compress, by using low-level HDF5 primitives in m-code or mex c-code.
A general conclusion that can be drawn from all this is that in the specific case of save, the additional time for compression is often NOT offset by the reduced I/O. So the general rule is to use –v6 whenever possible.
Although save –v6 does not compress its data, it does store data in a more compact manner than Matlab memory. So, while our test set’s cell array held 114 MB of Matlab memory, on disk it was only stored within 55 MB (= 244 MB – 189 MB).
The performance of saving non-numeric data can be dramatically improved (and the file size reduced correspondingly) by manually serializing the data into a series of uint8 bytes that can easily be saved very compactly. When loading the files, we would simply deserialize the loaded data. I recommend using Christian Kothe’s excellent Fast serialize/deserialize utility on Matlab’s File Exchange. The huge gain in performance and file size when using serialized data is absolutely amazing (esp. for data types that -v6 cannot save), for all the save alternatives:

B = num2cell(randn(1,1e6));  % 1M cell array, 114 MB in Matlab memory
B_ser = hlp_serialize(B);

B = num2cell(randn(1,1e6)); % 1M cell array, 114 MB in Matlab memory B_ser = hlp_serialize(B);

Saved variable Matlab memory save -v7.3 save -v7 save -v6 savefast
B 114 MB 83 secs, 361 MB 4.5 secs, 9.2 MB 3.5 secs, 61 MB 83 secs, 361 MB
B_ser 7.6 MB 1.21 secs, 7.4 MB 1.17 secs, 7.4 MB 0.93 secs, 7.6 MB 0.94 secs, 7.6 MB

Serializing data in this manner enables save –v6 to be used even for Unicode and class objects (that would otherwise require –v7), as well as huge data (that would otherwise require >2GB files, and usage of –v7.3). One user has reported that the run-time for saving a 2.5GB cell-array of structs was reduced from hours to a single minute using serialization (despite the fact that he was using a non-optimized serialization, not Christian’s faster utility). In addition to the performance benefits, saving class objects in this manner avoids a memory leak bug that occurs when saving objects to MAT files on Matlab releases R2011b-R2012b (7.13-8.0).
When the data is purely numeric, we could use hdf5write or h5create + h5write, in addition to save and savefast (see related). Note that hdf5write will be phased out in a future Matlab release; MathWorks advises to use h5create + h5write instead. Here are the corresponding results for 184 MB of numeric data on a standard 5400 RPM hard disk and an SSD:

hdfwrite h5create + h5write (Deflate=0) h5create + h5write (Deflate=1) save -v7.3 save -v7 save -v6 savefast
File size 183 MB 366 MB 55 MB 42 MB 40 MB 183 MB 183 MB
Time (hard disk) 4.4 secs 14.3 secs 7.4 secs 6.1 secs 10.8 secs 4.2 secs 4.3 secs
Time (SSD) 2.1 secs 0.2 secs 4.5 secs 4.1 secs 9.5 secs 1.5 secs 1.6 secs

As noted, Matlab’s HDF5 implementation is generally suboptimal. Better performance can be achieved by using the low-level HDF5 functions, rather than the high-level hdfwrite, hdf5read functions.
In addition to HDF5, Matlab also supports the HDF4 standard, using a separate set of built-in hdf functions. Despite their common name and origin, HDF4 and HDF5 are incompatible; use different data formats; and employ different designs, APIs and Matlab access functions. HDF4 is generally much slower than HDF5.
While save’s –v7.3 format is significantly slower than the alternatives for storing entire data elements, one specific case in which –v7.3 should indeed be considered is when we need to update or load just a small part of the data, on R2011b or newer. This could potentially save a lot of I/O, especially for large MAT files where only a small part is updated or loaded.

Matlab Computational Finance Conference - 23 May 2013

Related posts:

  1. Improving fwrite performance – Standard file writing performance can be improved in Matlab in surprising ways. ...
  2. Improving Simulink performance – Simulink simulation run-time performance can be improved by orders of magnitude by following some simple steps. ...
  3. MLintFailureFiles or: Why can't I save my m-file?! – Sometimes Matlab gets into a state where it cannot use a valid m-file. This article explains what can be done. ...
  4. Improving graphics interactivity – Matlab R2018b added default axes mouse interactivity at the expense of performance. Luckily, we can speed-up the default axes. ...
  5. rmfield performance – The performance of the builtin rmfield function (as with many other builtin functions) can be improved by simple profiling. ...
  6. Some Matlab performance-tuning tips – Matlab can be made to run much faster using some simple optimization techniques. ...
Performance Pure Matlab
Print Print
« Previous
Next »
19 Responses
  1. Ben May 8, 2013 at 13:30 Reply

    Great post Yair. I myself had recently discovered that -v7.3 wasn’t being my friend when compared to -v7 in one of my simulations. Large-data save operations consume 100% of a single CPU, and large imagesc() figures take *minutes* to save. I don’t need random access to .mat files, so I have moved back to -v7 to save all my data and figures.

    It seems to me that the time devoted to compression could be reduced if MATLAB used multithreaded compression routines, where the compression would be CPU-bound rather than IO-bound. In Ubuntu I have started to use the pigz (parallel gzip) and pbzip2 (parallel bzip2) utilities, which are happy to utilize all of my CPUs when compressing data. Deliberate parallelization into N jobs doesn’t always yield a speedup of a factor of N, but in this case I think you could get pretty close.

  2. oro77 May 8, 2013 at 18:21 Reply

    Serialization functions are a great find. Thank you for sharing !

    • Amro May 9, 2013 at 22:11 Reply

      There are undocumented functions in the MEX-API to serialize/deserialize mxArray’s:

      http://stackoverflow.com/a/6261884/97160
      https://github.com/kyamagu/matlab-serialization/

    • Martin May 10, 2013 at 02:54 Reply

      For those not wishing to deal with MEX files there are also two undocumented MATLAB functions which are apparently nothing more than a thin wrapper around mxSerialize and mxDeserialize. They are:

        uint8_vector = getByteStreamFromArray(any_variable);
        any_variable = getArrayFromByteStream(uint8_vector);

      uint8_vector = getByteStreamFromArray(any_variable); any_variable = getArrayFromByteStream(uint8_vector);

      The binary format seems to be 32 bit, so the largest variable I could successfully serialize and the deserialize was ~2 GiB in size (minus some overhead due to headers). But that’s probably sufficient for a lot of applications. The functions are available since MATLAB 2010b and it seems the binary format hasn’t changed since then, so they are pretty reliable/portable despite being undocumented. (I only tested using 64-bit MATLAB on OS X and Linux).

      • Yair Altman May 10, 2013 at 03:12

        @Martin – Nice find! Thanks for reporting.

        I see getByteStreamFromArray used in a single MATLAB m-file on R2010b (%matlabroot%/toolbox/matlab/scribe/private/getCopyStructureFromObject.m). It is no longer used (although it is still available) in R2013a. Likewise, getArrayFromByteStream is used on R2010b only in %matlabroot%/toolbox/matlab/scribe/private/getObjectFromCopyStructure.m, and is not used (although still available) in R2013a.

        Now that’s what I call a deeply-hidden feature!
        +5 if I could 🙂

  3. Martin May 10, 2013 at 01:28 Reply

    While the HDF5-based v7.3 file format may have some disadvantages, it has one huge advantage for me: It can save everything in my workspace as long as there is enough disk space. You start to appreciate this once you spend hours/days to compute a multi-gigabyte result to later find out that it wasn’t saved because it was too big for the v7 file format and MATLAB stupidly does not consider this to be a fatal condition (I would have expected to get an exception/error) and instead happily continues. That’s why v7.3 is my default and I can happily wait a few more seconds/minutes on save.

    • Yair Altman May 10, 2013 at 01:35 Reply

      @Martin – naturally, but what I tried to show in the article is that you can have the best of both worlds, by using savefast and serialization.

  4. Greg May 13, 2013 at 04:42 Reply

    Saving Matlab O-O objects in the -v7.3 format has a bug in R2012a and R2012b (supposedly fixed in R2013a) which causes a memory leak when the objects are loaded back into Matlab. Not only do the normal issues with a memory leak apply, but there is also an issue that when you “clear” that variable, Matlab knows that all objects of that class have not been cleared, and will not pick up on modifications to the class’s definition, instead issuing the “Warning: Objects of ‘MyClass’ class exist” warning that everyone who does any O-O programming in Matlab is very familiar with.

    This issue is an official bug report on the Mathworks website: http://www.mathworks.com/support/bugreports/857319

    • Yair Altman May 13, 2013 at 04:49 Reply

      @Greg – thanks. Yet another reason to avoid -v7.3 …

  5. How to store large datasets? | Matlabtips.com May 15, 2013 at 05:02 Reply

    […] If you need to optimizing read/write speed when dealing with MAT/HDF5 files, I encourage you to read Yair’s excellent post on that topic. […]

  6. Robin June 17, 2013 at 09:41 Reply

    Are you sure Matlab doesn’t compress structures/cells saved in v7.3?

    I have exactly the opposite problem that it is not able to disable compression, which in v7.3 provides almost no space benefit for my data at the cost of extremely long save and load times.

    When I save structures/cell arrays compression is still used. I think compression kicks in for any numeric object which is greater than 10000 bytes (so if you have large structure or cell contents I think they will still be compressed).

    The serialization trick is really interesting, thanks!

  7. How to store large datasets? | Matlabtips.com December 23, 2013 at 00:04 Reply

    […] you need to optimizing read/write speed when dealing with MAT/HDF5 files, I encourage you to read Yair’s excellent post on that […]

  8. Serializing/deserializing Matlab data | Undocumented Matlab January 22, 2014 at 12:58 Reply

    […] Last year I wrote an article on improving the performance of the save function. The article discussed various ways by which we can store Matlab data on disk. […]

  9. Explicit multi-threading in Matlab – part 1 | Undocumented Matlab February 19, 2014 at 13:20 Reply

    […] to be too slow for our specific needs. We could perhaps improve it a bit with some fancy tricks for save or fwrite. But let’s take a different approach today, using multi-threading:Using Java […]

  10. Roger April 16, 2014 at 08:27 Reply

    What about the load function? Inside a function, I found that using assignin(‘base’, variables) is faster than using evalin(‘base’, load .mat containing the variables). I thought loading the variables from a .mat file would be faster than using the assignin command. Would really like to see the same kind of topic but on the load command.

  11. Hugo June 29, 2014 at 22:47 Reply

    Hello,
    I’ve just tried to save a figure (around 2e4 double data points) with -v7.3 in R2014a, and it resulted in 100Mb file size instead of 1.4Mb with default -v7. A new bug maybe?

  12. jkoether January 22, 2015 at 11:35 Reply

    For me, the biggest advantage of savefast is that by saving uncompressed, a specific portion of an array can be loaded MUCH faster when using the matfile function. There seems to be some decompression overhead that is a function of the array size regardless of how little data you are asking for. For example, with a 250Mb file just using the save function, it take a minimum of 2 seconds to extract data, even if I am only loading 10k elements. If the file is created by savefast it only takes .02 seconds to load 10k elements. This will be a huge improvement for my software.

  13. CaptainObv May 20, 2015 at 08:16 Reply

    Is it possible to save matfiles without creating the initial variable at the beginning?

    I am continually adding output data to my matfile block by block (I.e continually expanding its dimensions). I want to speed this up and I considered preallocating the space within the matfile (as you do with regular arrays before a loop).

    However my initial variable will be quite large (hence why I am block processing), and I don’t want to risk the possibility of a memory error, so ideally I’d like To create the Matfile without Saving the variable to matlab memory as you did with your A and B variables.

    Is there a way?

    • Yair Altman May 20, 2015 at 08:18 Reply

      Not directly, but you could try using the low-level HDF5 functions for this.

Leave a Reply
HTML tags such as <b> or <i> are accepted.
Wrap code fragments inside <pre lang="matlab"> tags, like this:
<pre lang="matlab">
a = magic(3);
disp(sum(a))
</pre>
I reserve the right to edit/delete comments (read the site policies).
Not all comments will be answered. You can always email me (altmany at gmail) for private consulting.

Click here to cancel reply.

Useful links
  •  Email Yair Altman
  •  Subscribe to new posts (feed)
  •  Subscribe to new posts (reader)
  •  Subscribe to comments (feed)
 
Accelerating MATLAB Performance book
Recent Posts

Speeding-up builtin Matlab functions – part 3

Improving graphics interactivity

Interesting Matlab puzzle – analysis

Interesting Matlab puzzle

Undocumented plot marker types

Matlab toolstrip – part 9 (popup figures)

Matlab toolstrip – part 8 (galleries)

Matlab toolstrip – part 7 (selection controls)

Matlab toolstrip – part 6 (complex controls)

Matlab toolstrip – part 5 (icons)

Matlab toolstrip – part 4 (control customization)

Reverting axes controls in figure toolbar

Matlab toolstrip – part 3 (basic customization)

Matlab toolstrip – part 2 (ToolGroup App)

Matlab toolstrip – part 1

Categories
  • Desktop (45)
  • Figure window (59)
  • Guest bloggers (65)
  • GUI (165)
  • Handle graphics (84)
  • Hidden property (42)
  • Icons (15)
  • Java (174)
  • Listeners (22)
  • Memory (16)
  • Mex (13)
  • Presumed future risk (394)
    • High risk of breaking in future versions (100)
    • Low risk of breaking in future versions (160)
    • Medium risk of breaking in future versions (136)
  • Public presentation (6)
  • Semi-documented feature (10)
  • Semi-documented function (35)
  • Stock Matlab function (140)
  • Toolbox (10)
  • UI controls (52)
  • Uncategorized (13)
  • Undocumented feature (217)
  • Undocumented function (37)
Tags
AppDesigner (9) Callbacks (31) Compiler (10) Desktop (38) Donn Shull (10) Editor (8) Figure (19) FindJObj (27) GUI (141) GUIDE (8) Handle graphics (78) HG2 (34) Hidden property (51) HTML (26) Icons (9) Internal component (39) Java (178) JavaFrame (20) JIDE (19) JMI (8) Listener (17) Malcolm Lidierth (8) MCOS (11) Memory (13) Menubar (9) Mex (14) Optical illusion (11) Performance (78) Profiler (9) Pure Matlab (187) schema (7) schema.class (8) schema.prop (18) Semi-documented feature (6) Semi-documented function (33) Toolbar (14) Toolstrip (13) uicontrol (37) uifigure (8) UIInspect (12) uitable (6) uitools (20) Undocumented feature (187) Undocumented function (37) Undocumented property (20)
Recent Comments
Contact us
Captcha image for Custom Contact Forms plugin. You must type the numbers shown in the image
Undocumented Matlab © 2009 - Yair Altman
This website and Octahedron Ltd. are not affiliated with The MathWorks Inc.; MATLAB® is a registered trademark of The MathWorks Inc.
Scroll to top