A few months ago I wrote about Matlab’s undocumented serialization/deserialization functions, getByteStreamFromArray and getArrayFromByteStream. This could be very useful for both sending Matlab data across a network (thus avoiding the need to use a shared file), as well as for much faster data-save using the -V6 MAT format (save -v6 …).
As a followup to that article, in some cases it might be useful to use ZIP/GZIP compression, rather than Matlab’s proprietary MAT format or an uncompressed byte-stream.
Unfortunately, Matlab’s compression functions zip, gzip and tar do not really help run-time performance, but rather hurt it. The reason is that we would be paying the I/O costs three times: first to write the original (uncompressed) file, then to have zip or its counterparts read it, and finally to save the compressed file. tar is worst in this respect, since it does both a GZIP compression and a simple tar concatenation to get a standard tar.gz file. Using zip/gzip/tar only makes sense if we need to pass the data file to some external program on some remote server, whereby compressing the file could save transfer time. But as far as our Matlab program’s performance is concerned, these functions bring little value.
In contrast to file-system compression, which is what zip/gzip/tar do, on-the-fly (memory) compression makes more sense and can indeed help performance. In this case, we are compressing the data in memory, and directly saving to file the resulting (compressed) binary data. The following example compresses int8 data, such as the output of our getByteStreamFromArray serialization:
% Serialize the data into a 1D array of uint8 bytes dataInBytes = getByteStreamFromArray(data); % Parse the requested output filename (full path name) [fpath,fname,fext] = fileparts(filepath); % Compress in memory and save to the requested file in ZIP format fos = java.io.FileOutputStream(filepath); %fos = java.io.BufferedOutputStream(fos, 8*1024); % not really important (*ZipOutputStream are already buffered), but doesn't hurt if strcmpi(fext,'.gz') % Gzip variant: zos = java.util.zip.GZIPOutputStream(fos); % note the capitalization else % Zip variant: zos = java.util.zip.ZipOutputStream(fos); % or: org.apache.tools.zip.ZipOutputStream as used by Matlab's zip.m ze = java.util.zip.ZipEntry('data.dat'); % or: org.apache.tools.zip.ZipEntry as used by Matlab's zip.m ze.setSize(originalDataSizeInBytes); zos.setLevel(9); % set the compression level (0=none, 9=max) zos.putNextEntry(ze); end zos.write(dataInBytes, 0, numel(dataInBytes)); zos.finish; zos.close;
This will directly create a zip archive file in the current folder. The archive will contain a single entry (data.dat) that contains our original data. Note that data.dat is entirely virtual: it was never actually created, saving us its associated I/O costs. In fact we could have called it simply data, or whatever other valid file name.
Saving to a gzip file is even simpler, since GZIP files have single file entries. There is no use for a
ZipEntry as in zip archives that may contain multiple file entries.
Note that while the resulting ZIP/GZIP file is often smaller in size than the corresponding MAT file generated by Matlab’s save, it is not necessarily faster. In fact, except on slow disks or network drives, save may well outperform this mechanism. However, in some cases, the reduced file size may save enough I/O to offset the extra processing time. Moreover, GZIP is typically much faster than either ZIP or Matlab’s save.
Loading data from ZIP/GZIP
Similar logic applies to reading compressed data: We could indeed use unzip/gunzip/untar, but these would increase the I/O costs by reading the compressed file, saving the uncompressed version, and then reading that uncompressed file into Matlab.
A better solution would be to read the compressed file directly into Matlab. Unfortunately, the corresponding input-stream classes do not have a read() method that returns a byte array. We therefore use a small hack to copy the input stream into a
ByteArrayOutputStream, using Matlab’s own stream-copier class that is used within all of Matlab’s compression and decompression functions:
% Parse the requested output filename (full path name) [fpath,fname,fext] = fileparts(filepath); % Get the serialized data streamCopier = com.mathworks.mlwidgets.io.InterruptibleStreamCopier.getInterruptibleStreamCopier; baos = java.io.ByteArrayOutputStream; fis = java.io.FileInputStream(filepath); if strcmpi(fext,'.gz') % Gzip variant: zis = java.util.zip.GZIPInputStream(fis); else % Zip variant: zis = java.util.zip.ZipInputStream(fis); % Note: although the ze & fileName variables are unused in the Matlab % ^^^^ code below, they are essential in order to read the ZIP! ze = zis.getNextEntry; fileName = char(ze.getName); %#ok<nasgu> => 'data.dat' (virtual data file) end streamCopier.copyStream(zis,baos); fis.close; data = baos.toByteArray; % array of Matlab int8 % Deserialize the data back into the original Matlab data format % Note: the zipped data is int8 => need to convert into uint8: % Note2: see discussion with Martin in the comments section below if numel(data) < 1e5 data = uint8(mod(int16(data),256))'; else data = typecast(data, 'uint8'); end data = getArrayFromByteStream(data);
Note that when we deserialize, we have to convert the unzipped int8 byte-stream into a uint8 byte-stream that getArrayFromByteStream can process (we don’t need to do this during serialization).
The SAVEZIP utility
I have uploaded a utility called SAVEZIP to the Matlab File Exchange which includes the savezip and loadzip functions. These include the code above, plus some extra sanity checks and data processing. Usage is quite simple:
savezip('myData', magic(4)) %save data to myData.zip in current folder savezip('myData', 'myVar') %save myVar to myData.zip in current folder savezip('myData.gz', 'myVar') %save data to myData.gz in current folder savezip('data\myData', magic(4)) %save data to .\data\myData.zip savezip('data\myData.gz', magic(4)) %save data to .\data\myData.gz myData = loadzip('myData'); myData = loadzip('myData.zip'); myData = loadzip('data\myData'); myData = loadzip('data\myData.gz');
Jan Berling has written another variant of the idea of using getByteStreamFromArray and getArrayFromByteStream for saving/loading data from disk, in this case in an uncompressed manner. He put it all in his Bytestream Save Toolbox on the File Exchange.
Transmitting compressed data via the network
If instead of saving to a file we wish to transmit the compressed data to a remote process (or to save it ourselves later), we can simply wrap our
ZipOutputStream with a
ByteArrayOutputStream rather than a
FileOutputStream. For example, on the way out:
baos = java.io.ByteArrayOutputStream; if isGzipVarant zos = java.util.zip.GZIPOutputStream(baos); else % Zip variant zos = java.util.zip.ZipOutputStream(baos); ze = java.util.zip.ZipEntry('data.dat'); ze.setSize(numel(data)); zos.setLevel(9); zos.putNextEntry(ze); end dataInBytes = int8(data); % or: getByteStreamFromArray(data) zos.write(dataInBytes,0,numel(dataInBytes)); zos.finish; zos.close; compressedDataArray = baos.toByteArray; % array of Matlab int8
I leave it as an exercise to the reader to make the corresponding changes for the receiving end.
New introductory Matlab book
Matlab has a plethora of introductory books. But I have a special affection to one that was released only a few days ago: MATLAB Succinctly by Dmitri Nesteruk, for which I was a technical editor/reviewer. It’s a very readable and easy-to-follow book, and it’s totally free, so go ahead and download.
This title adds to the large (and growing) set of free ~100-page introductory titles by Syncfusion, on a wide variety of programming languages and technologies. Go ahead and download these books as well. While you’re at it, take a look at Syncfusion’s set of professional components and spread the word. If Syncfusion gets enough income from such incoming traffic, they may continue to support their commendable project of similar free IT-related titles.
This may be a good place to update that my own [second] book, Accelerating MATLAB Performance, is nearing completing, and is currently in advanced copy-editing stage. It turned out that there was a lot more to Matlab performance than I initially realized. The book size (and writing time) doubled, and it turned out to be a hefty ~750 pages packed full with performance improvement tips. I’m still considering the cover image (ideas anyone?) and working on the index, but the end is now in sight. The book should be in your favorite bookstore this December. If you can’t wait until then, and/or if you’d rather use the real McCoy, consider inviting me for a consulting session…
Addendum December 16, 2014: I am pleased to announce that my book, Accelerating MATLAB Performance, is now published (additional information).
I think you want to replace the relatively expensive
with a much faster typecast
because what you really want is just to interpret the bits of the data differently. (Assuming MATLAB is clever enough to optimize away duplicating data, this operation is just a matter of changing the type field. And at least in the prerelease of R2014b it is that clever.)
I’m still somewhat bothered by the fact that these functions (getArrayFromByteStream and getByteStreamFromArray) cannot handle arbitrarily large MATLAB variables, i.e. those exceeding two (or was it four?) gigabytes in size. This probably does not matter to most users, but it’s not uncommon to have such sizes in scientific applications.