A few months ago I wrote about Matlab’s undocumented serialization/deserialization functions, getByteStreamFromArray and getArrayFromByteStream. This could be very useful for both sending Matlab data across a network (thus avoiding the need to use a shared file), as well as for much faster data-save using the -V6 MAT format (save -v6 …).
As a followup to that article, in some cases it might be useful to use ZIP/GZIP compression, rather than Matlab’s proprietary MAT format or an uncompressed byte-stream.
Unfortunately, Matlab’s compression functions zip, gzip and tar do not really help run-time performance, but rather hurt it. The reason is that we would be paying the I/O costs three times: first to write the original (uncompressed) file, then to have zip or its counterparts read it, and finally to save the compressed file. tar is worst in this respect, since it does both a GZIP compression and a simple tar concatenation to get a standard tar.gz file. Using zip/gzip/tar only makes sense if we need to pass the data file to some external program on some remote server, whereby compressing the file could save transfer time. But as far as our Matlab program’s performance is concerned, these functions bring little value.
In contrast to file-system compression, which is what zip/gzip/tar do, on-the-fly (memory) compression makes more sense and can indeed help performance. In this case, we are compressing the data in memory, and directly saving to file the resulting (compressed) binary data. The following example compresses int8 data, such as the output of our getByteStreamFromArray serialization:
% Serialize the data into a 1D array of uint8 bytes dataInBytes = getByteStreamFromArray(data); % Parse the requested output filename (full path name) [fpath,fname,fext] = fileparts(filepath); % Compress in memory and save to the requested file in ZIP format fos = java.io.FileOutputStream(filepath); %fos = java.io.BufferedOutputStream(fos, 8*1024); % not really important (*ZipOutputStream are already buffered), but doesn't hurt if strcmpi(fext,'.gz') % Gzip variant: zos = java.util.zip.GZIPOutputStream(fos); % note the capitalization else % Zip variant: zos = java.util.zip.ZipOutputStream(fos); % or: org.apache.tools.zip.ZipOutputStream as used by Matlab's zip.m ze = java.util.zip.ZipEntry('data.dat'); % or: org.apache.tools.zip.ZipEntry as used by Matlab's zip.m ze.setSize(originalDataSizeInBytes); zos.setLevel(9); % set the compression level (0=none, 9=max) zos.putNextEntry(ze); end zos.write(dataInBytes, 0, numel(dataInBytes)); zos.finish; zos.close; |
This will directly create a zip archive file in the current folder. The archive will contain a single entry (data.dat) that contains our original data. Note that data.dat is entirely virtual: it was never actually created, saving us its associated I/O costs. In fact we could have called it simply data, or whatever other valid file name.
Saving to a gzip file is even simpler, since GZIP files have single file entries. There is no use for a ZipEntry
as in zip archives that may contain multiple file entries.
Note that while the resulting ZIP/GZIP file is often smaller in size than the corresponding MAT file generated by Matlab’s save, it is not necessarily faster. In fact, except on slow disks or network drives, save may well outperform this mechanism. However, in some cases, the reduced file size may save enough I/O to offset the extra processing time. Moreover, GZIP is typically much faster than either ZIP or Matlab’s save.
Loading data from ZIP/GZIP
Similar logic applies to reading compressed data: We could indeed use unzip/gunzip/untar, but these would increase the I/O costs by reading the compressed file, saving the uncompressed version, and then reading that uncompressed file into Matlab.
A better solution would be to read the compressed file directly into Matlab. Unfortunately, the corresponding input-stream classes do not have a read() method that returns a byte array. We therefore use a small hack to copy the input stream into a ByteArrayOutputStream
, using Matlab’s own stream-copier class that is used within all of Matlab’s compression and decompression functions:
% Parse the requested output filename (full path name) [fpath,fname,fext] = fileparts(filepath); % Get the serialized data streamCopier = com.mathworks.mlwidgets.io.InterruptibleStreamCopier.getInterruptibleStreamCopier; baos = java.io.ByteArrayOutputStream; fis = java.io.FileInputStream(filepath); if strcmpi(fext,'.gz') % Gzip variant: zis = java.util.zip.GZIPInputStream(fis); else % Zip variant: zis = java.util.zip.ZipInputStream(fis); % Note: although the ze & fileName variables are unused in the Matlab % ^^^^ code below, they are essential in order to read the ZIP! ze = zis.getNextEntry; fileName = char(ze.getName); %#ok<nasgu> => 'data.dat' (virtual data file) end streamCopier.copyStream(zis,baos); fis.close; data = baos.toByteArray; % array of Matlab int8 % Deserialize the data back into the original Matlab data format % Note: the zipped data is int8 => need to convert into uint8: % Note2: see discussion with Martin in the comments section below if numel(data) < 1e5 data = uint8(mod(int16(data),256))'; else data = typecast(data, 'uint8'); end data = getArrayFromByteStream(data); |
Note that when we deserialize, we have to convert the unzipped int8 byte-stream into a uint8 byte-stream that getArrayFromByteStream can process (we don’t need to do this during serialization).
The SAVEZIP utility
I have uploaded a utility called SAVEZIP to the Matlab File Exchange which includes the savezip and loadzip functions. These include the code above, plus some extra sanity checks and data processing. Usage is quite simple:
savezip('myData', magic(4)) %save data to myData.zip in current folder savezip('myData', 'myVar') %save myVar to myData.zip in current folder savezip('myData.gz', 'myVar') %save data to myData.gz in current folder savezip('data\myData', magic(4)) %save data to .\data\myData.zip savezip('data\myData.gz', magic(4)) %save data to .\data\myData.gz myData = loadzip('myData'); myData = loadzip('myData.zip'); myData = loadzip('data\myData'); myData = loadzip('data\myData.gz'); |
Jan Berling has written another variant of the idea of using getByteStreamFromArray and getArrayFromByteStream for saving/loading data from disk, in this case in an uncompressed manner. He put it all in his Bytestream Save Toolbox on the File Exchange.
Transmitting compressed data via the network
If instead of saving to a file we wish to transmit the compressed data to a remote process (or to save it ourselves later), we can simply wrap our ZipOutputStream
with a ByteArrayOutputStream
rather than a FileOutputStream
. For example, on the way out:
baos = java.io.ByteArrayOutputStream; if isGzipVarant zos = java.util.zip.GZIPOutputStream(baos); else % Zip variant zos = java.util.zip.ZipOutputStream(baos); ze = java.util.zip.ZipEntry('data.dat'); ze.setSize(numel(data)); zos.setLevel(9); zos.putNextEntry(ze); end dataInBytes = int8(data); % or: getByteStreamFromArray(data) zos.write(dataInBytes,0,numel(dataInBytes)); zos.finish; zos.close; compressedDataArray = baos.toByteArray; % array of Matlab int8 |
I leave it as an exercise to the reader to make the corresponding changes for the receiving end.
New introductory Matlab book
Matlab has a plethora of introductory books. But I have a special affection to one that was released only a few days ago: MATLAB Succinctly by Dmitri Nesteruk, for which I was a technical editor/reviewer. It’s a very readable and easy-to-follow book, and it’s totally free, so go ahead and download.
This title adds to the large (and growing) set of free ~100-page introductory titles by Syncfusion, on a wide variety of programming languages and technologies. Go ahead and download these books as well. While you’re at it, take a look at Syncfusion’s set of professional components and spread the word. If Syncfusion gets enough income from such incoming traffic, they may continue to support their commendable project of similar free IT-related titles.
This may be a good place to update that my own [second] book, Accelerating MATLAB Performance, is nearing completing, and is currently in advanced copy-editing stage. It turned out that there was a lot more to Matlab performance than I initially realized. The book size (and writing time) doubled, and it turned out to be a hefty ~750 pages packed full with performance improvement tips. I’m still considering the cover image (ideas anyone?) and working on the index, but the end is now in sight. The book should be in your favorite bookstore this December. If you can’t wait until then, and/or if you’d rather use the real McCoy, consider inviting me for a consulting session…
Addendum December 16, 2014: I am pleased to announce that my book, Accelerating MATLAB Performance, is now published (additional information).
I think you want to replace the relatively expensive
with a much faster typecast
because what you really want is just to interpret the bits of the data differently. (Assuming MATLAB is clever enough to optimize away duplicating data, this operation is just a matter of changing the type field. And at least in the prerelease of R2014b it is that clever.)
I’m still somewhat bothered by the fact that these functions (getArrayFromByteStream and getByteStreamFromArray) cannot handle arbitrarily large MATLAB variables, i.e. those exceeding two (or was it four?) gigabytes in size. This probably does not matter to most users, but it’s not uncommon to have such sizes in scientific applications.
@Martin – thanks for the feedback. My method (
data = uint8(mod(int16(data),256))'
) has a linear time cost which is faster than typcast‘s constant cost (1-2ms) for data arrays up to ~1e5 elements in size (=100KB). Of course, YMMV on your specific platform. In any case, for smaller arrays my method is faster, for larger arrays typcast is better. Memory is not much of an issue I think, since my method does the operation in-place I should think.I hear you re the 2-4GB limit. Alas, I’m not familiar with any magic wand here.
I guess the input variable size for getByteStreamFromArray is limited by the memory.
If there is a hard size-limit it would be good to know it, so that assertions can be inserted in using functions.
Is it possible to split arbitrary MATLAB types in a formal way? splitting arbitrary cells and structs in a heuristic way would be possible but not beautiful at all.
@Jan – I am not aware of a way to do this.
@Jan – The limitation is definitely not due to the available memory. (Nowadays it is relatively easy to verify this empirically, since most people have access to machines with 8 GB RAM or more.)
The limit is either 2 or 4 GB depending on the type of data, since the format uses 32-bit signed integers in some places and 32-bit unsigned integers in other places. If we stick to plain arrays, the limit is 2^32-1 bytes or entries in one dimensions, i.e. this stays within the limit (and thus works)
while
both fail.
If you use aggregate data types, e.g. cells or structs, then the limit is 4 GB. You can put 3 arrays of size 1 GB into a cell array and then successfully serialize it.
(However, there’s an elegant way to get beyond this limitation. Watch out for one of my future comments.)
@Yair – That’s an interesting (and to me quite surprising) observation that typecast is slower that the computation you use(d). I have to admit that I never benchmarked anything to support my claim from above. I still believe typecast is better because it makes the intent much clearer (to a low-level coder like I am).
If you want to get rid of copyStream paired with the rather awkward com.mathworks.mlwidgets.io.InterruptibleStreamCopier.getInterruptibleStreamCopier and replace it with something more standard (though not part of the core JDK), you can use IOUtils from Apache Commons, which also ships with MATLAB:
You can even get rid of the intermediate ByteArrayOutputStream (possibly resulting in improved speed and lower memory consumption) and simply write:
I haven’t checked when this became available in MATLAB, so this might not be the best solution for ancient MATLAB releases.
Thanks for mentioning my toolbox!
The suggestion from Martin does not work on my ancient R2011b.
Because of the calculation time of the getByteStreamFromArray function it could be beneficial to calculate the bytestream directly in the function call for the Java-Stream. Maybe along the lines of
Like this there is a double calculation of the bytestream, there should be another way to calculate the size of the bytestream…
@Jan – you are not avoiding the Matlab call/memory by placing the getByteStreamFromArray twice in the
zos.write()
call – you are simply multiplying it. In this case it would be better to use a local variable to store the data:@Jan – I’m surprised org.apache.commons.io.IOUtils.toByteArray() didn’t work for you. I checked R2009b to R2014b and they all bundle IO from Apache Commons (R2009b-R2012a has version 1.3.1, R2012b has 2.1, and R2013a-R2014b has 2.4). I tested on Linux (64-bit, R2009b-R2014a) and on OS X (64-bit, R2013a-R2014b) and in all cases my test script succeeded. I wonder what’s different in your configuration. This is my minimal test (that has to be put in a script file):
James Tursa’s MEX implementation of TYPECASTX uses a shared data copy instead of a deep copy, which is much faster than the original Matlab version: http://www.mathworks.com/matlabcentral/fileexchange/17476-typecast-and-typecastx-c-mex-functions
@Jan – Thanks for the mention of Tursa’s TYPECASTX. I am of course familiar with it (and mention it in my upcoming book). Tursa’s TYPECASTX is indeed better than Matlab’s builtin typecast. However, in this specific case, type-casting usually takes only a small fraction of the overall time, so I felt it is better to use the built-in functions for simplicity.
Similarly, it would be great to zip/unzip data into/from a Matlab variable, e.g.
This could be an invaluable tool when we want to keep in memory large amount of redundant data for maximum speed data access, e.g. picking rapidly images from thousands-of-images-database, which exceed the size of memory when it’s uncompressed.
Great post – thx all!
Would you know a way to write a table directly to a zip/gz-file instead of writetable() to csv and subsequenly compressing it? As far as I know, in Java that’s possible.
Any help is greatly appreciated.
@Holger – read the article carefully: I explained how you can serialize any Matlab data (including tables) into a byte stream that can then be compressed into a zip/gz file. You can use the savezip utility for this as explained in the article.
Dear Yair,
thanks for the quick response. Your code works perfectly fine. My problem is, that the file written is a gzipped mat-file. At least, if I uncompress it with some other program than matlab, I cannot interpret it easily. For a specific application, I need to have a gzipped csv-file (table). That means, if I gunzip it with some other program, it should give a common csv-file. Now, I guess that is possible with your code. But I don’t understand how. Could you please give an example? This would really be helpful and boost my performance a lot.
Thanks again.
You will need to manually extract your data into CSV format from Matlab’s table object. There is no direct way of doing that other than using writetable.
Thanks for clarifying. That’s what I guessed. However, it is possible in Java to read and write into or out of a gzipped csv-file in the format of old-school spreadsheet (e.g. Excel-compatible). So shouldn’t it be possible with MatLab via Java too? Perhaps that would require to mimick the “writetable” somehow. Unfortunately I currently dont have the time to continue with this, but it would still be super great, if someone figures this out.
Hello, thanks for sharing! I currently receive the following memory error message when using savezip with a big object (Matlab R2020b, Windows 10).
Would you have an idea how the internal java memory limit can be increased?
You can try to increase the Java heap memory size in the Matlab preferences (General => Java Heap Memory). Any changes to the settings will only take effect after restarting Matlab.
https://www.mathworks.com/help/matlab/matlab_external/java-heap-memory-preferences.html
Thanks Yair, and sorry I’ve not been warned of your answer.
I’m facing again the issue, even when increasing this Matlab setting to its maximum (8,159Mb). I guess the issue is completely independent from the savezip function right? Is there a way to use another type of memory allowing to save such a big file without relying on Matlab’s internal java memory limit?
Try to use a filename with
.gz
extension – this will useGZIPOutputStream
instead ofZipOutputStream
to compress the data, and perhaps it will solve the problem. If not, split your data into 2 separate parts and save/load them separately.