savezip utility

September 4, 2014

A few months ago I wrote about Matlab’s undocumented serialization/deserialization functions ^{[1]}, * getByteStreamFromArray* and

As a followup to that article, in some cases it might be useful to use ZIP/GZIP compression, rather than Matlab’s proprietary MAT format or an uncompressed byte-stream.

Unfortunately, Matlab’s compression functions

In contrast to file-system compression, which is what

```
% Serialize the data into a 1D array of uint8 bytes
dataInBytes = getByteStreamFromArray(data);
% Parse the requested output filename (full path name)
[fpath,fname,fext] = fileparts(filepath);
% Compress in memory and save to the requested file in ZIP format
fos = java.io.FileOutputStream(filepath);
%fos = java.io.BufferedOutputStream(fos, 8*1024); % not really important (*ZipOutputStream are already buffered), but doesn't hurt
if strcmpi(fext,'.gz')
% Gzip variant:
zos = java.util.zip.GZIPOutputStream(fos); % note the capitalization
else
% Zip variant:
zos = java.util.zip.ZipOutputStream(fos); % or: org.apache.tools.zip.ZipOutputStream as used by Matlab's zip.m
ze = java.util.zip.ZipEntry('data.dat'); % or: org.apache.tools.zip.ZipEntry as used by Matlab's zip.m
ze.setSize(originalDataSizeInBytes);
zos.setLevel(9); % set the compression level (0=none, 9=max)
zos.putNextEntry(ze);
end
zos.write(dataInBytes, 0, numel(dataInBytes));
zos.finish;
zos.close;
```

This will directly create a zip archive file in the current folder. The archive will contain a single entry (*data.dat*) that contains our original data. Note that *data.dat* is entirely virtual: it was never actually created, saving us its associated I/O costs. In fact we could have called it simply *data*, or whatever other valid file name.

Saving to a gzip file is even simpler, since GZIP files have single file entries. There is no use for a `ZipEntry`

as in zip archives that may contain multiple file entries.

Note that while the resulting ZIP/GZIP file is often smaller in size than the corresponding MAT file generated by Matlab’s * save*, it is not necessarily faster. In fact, except on slow disks or network drives,

Similar logic applies to reading compressed data: We could indeed use * unzip/gunzip/untar*, but these would increase the I/O costs by reading the compressed file, saving the uncompressed version, and then reading that uncompressed file into Matlab.

A better solution would be to read the compressed file directly into Matlab. Unfortunately, the corresponding input-stream classes do not have a

`ByteArrayOutputStream`

, using Matlab’s own stream-copier class that is used within all of Matlab’s compression and decompression functions:```
% Parse the requested output filename (full path name)
[fpath,fname,fext] = fileparts(filepath);
% Get the serialized data
streamCopier = com.mathworks.mlwidgets.io.InterruptibleStreamCopier.getInterruptibleStreamCopier;
baos = java.io.ByteArrayOutputStream;
fis = java.io.FileInputStream(filepath);
if strcmpi(fext,'.gz')
% Gzip variant:
zis = java.util.zip.GZIPInputStream(fis);
else
% Zip variant:
zis = java.util.zip.ZipInputStream(fis);
% Note: although the ze & fileName variables are unused in the Matlab
% ^^^^ code below, they are essential in order to read the ZIP!
ze = zis.getNextEntry;
fileName = char(ze.getName); %#ok
``` => 'data.dat' (virtual data file)
end
streamCopier.copyStream(zis,baos);
fis.close;
data = baos.toByteArray; % array of Matlab int8
% Deserialize the data back into the original Matlab data format
% Note: the zipped data is int8 => need to convert into uint8:
% Note2: see discussion with Martin in the comments section below
if numel(data) < 1e5
data = uint8(mod(int16(data),256))';
else
data = typecast(data, 'uint8');
end
data = getArrayFromByteStream(data);

Note that when we deserialize, we have to convert the unzipped * int8* byte-stream into a

I have uploaded a utility called SAVEZIP ^{[2]} to the Matlab File Exchange which includes the * savezip* and

```
savezip('myData', magic(4)) %save data to myData.zip in current folder
savezip('myData', 'myVar') %save myVar to myData.zip in current folder
savezip('myData.gz', 'myVar') %save data to myData.gz in current folder
savezip('data\myData', magic(4)) %save data to .\data\myData.zip
savezip('data\myData.gz', magic(4)) %save data to .\data\myData.gz
myData = loadzip('myData');
myData = loadzip('myData.zip');
myData = loadzip('data\myData');
myData = loadzip('data\myData.gz');
```

Jan Berling has written another variant of the idea of using * getByteStreamFromArray* and

If instead of saving to a file we wish to transmit the compressed data to a remote process (or to save it ourselves later), we can simply wrap our `ZipOutputStream`

with a `ByteArrayOutputStream`

rather than a `FileOutputStream`

. For example, on the way out:

```
baos = java.io.ByteArrayOutputStream;
if isGzipVarant
zos = java.util.zip.GZIPOutputStream(baos);
else % Zip variant
zos = java.util.zip.ZipOutputStream(baos);
ze = java.util.zip.ZipEntry('data.dat');
ze.setSize(numel(data));
zos.setLevel(9);
zos.putNextEntry(ze);
end
dataInBytes = int8(data); % or: getByteStreamFromArray(data)
zos.write(dataInBytes,0,numel(dataInBytes));
zos.finish;
zos.close;
compressedDataArray = baos.toByteArray; % array of Matlab int8
```

I leave it as an exercise to the reader to make the corresponding changes for the receiving end.

Matlab has a plethora of introductory books. But I have a special affection to one that was released only a few days ago: *MATLAB Succinctly ^{[4]}* by Dmitri Nesteruk, for which I was a technical editor/reviewer. It's a very readable and easy-to-follow book, and it's totally free, so go ahead and download.

This may be a good place to update that my own [second] book,

18 Comments To "savezip utility"

#1 CommentByMartinOn September 4, 2014 @ 16:18I think you want to replace the relatively expensive

with a much faster typecast

because what you really want is just to interpret the bits of the

datadifferently. (Assuming MATLAB is clever enough to optimize away duplicatingdata, this operation is just a matter of changing the type field. And at least in the prerelease of R2014b it is that clever.)I’m still somewhat bothered by the fact that these functions (

getArrayFromByteStreamandgetByteStreamFromArray) cannot handle arbitrarily large MATLAB variables, i.e. those exceeding two (or was it four?) gigabytes in size. This probably does not matter to most users, but it’s not uncommon to have such sizes in scientific applications.#2 CommentByYair AltmanOn September 5, 2014 @ 06:44@Martin – thanks for the feedback. My method (

`data = uint8(mod(int16(data),256))'`

) has a linear time cost which is faster than‘s constant cost (1-2ms) for data arrays up to ~1e5 elements in size (=100KB). Of course, YMMV on your specific platform. In any case, for smaller arrays my method is faster, for larger arraystypcastis better. Memory is not much of an issue I think, since my method does the operation in-place I should think.typcastI hear you re the 2-4GB limit. Alas, I’m not familiar with any magic wand here.

#3 CommentByJan BerlingOn September 11, 2014 @ 00:27I guess the input variable size for

getByteStreamFromArrayis limited by the memory.If there is a hard size-limit it would be good to know it, so that assertions can be inserted in using functions.

Is it possible to split arbitrary MATLAB types in a formal way? splitting arbitrary cells and structs in a heuristic way would be possible but not beautiful at all.

#4 CommentByYair AltmanOn September 11, 2014 @ 00:31@Jan – I am not aware of a way to do this.

#5 CommentByMartinOn September 12, 2014 @ 16:47@Jan – The limitation is definitely not due to the available memory. (Nowadays it is relatively easy to verify this empirically, since most people have access to machines with 8 GB RAM or more.)

The limit is either 2 or 4 GB depending on the type of data, since the format uses 32-bit signed integers in some places and 32-bit unsigned integers in other places. If we stick to plain arrays, the limit is 2^32-1 bytes or entries in one dimensions, i.e. this stays within the limit (and thus works)

while

both fail.

If you use aggregate data types, e.g. cells or structs, then the limit is 4 GB. You can put 3 arrays of size 1 GB into a cell array and then successfully serialize it.

(However, there’s an elegant way to get beyond this limitation. Watch out for one of my future comments.)

#6 CommentByMartinOn September 12, 2014 @ 17:01@Yair – That’s an interesting (and to me quite surprising) observation that

typecastis slower that the computation you use(d). I have to admit that I never benchmarked anything to support my claim from above. I still believetypecastis better because it makes the intent much clearer (to a low-level coder like I am).#7 CommentByMartinOn September 4, 2014 @ 17:20If you want to get rid of

copyStreampaired with the rather awkwardcom.mathworks.mlwidgets.io.InterruptibleStreamCopier.getInterruptibleStreamCopierand replace it with something more standard (though not part of the core JDK), you can useIOUtilsfrom Apache Commons, which also ships with MATLAB:You can even get rid of the intermediate

ByteArrayOutputStream(possibly resulting in improved speed and lower memory consumption) and simply write:I haven’t checked when this became available in MATLAB, so this might not be the best solution for ancient MATLAB releases.

#8 CommentByJan BerlingOn September 11, 2014 @ 00:43Thanks for mentioning my toolbox!

The suggestion from Martin does not work on my ancient R2011b.

Because of the calculation time of the

getByteStreamFromArrayfunction it could be beneficial to calculate the bytestream directly in the function call for the Java-Stream. Maybe along the lines ofLike this there is a double calculation of the bytestream, there should be another way to calculate the size of the bytestream…

#9 CommentByYair AltmanOn September 11, 2014 @ 00:48@Jan – you are not avoiding the Matlab call/memory by placing the

twice in thegetByteStreamFromArray`zos.write()`

call – you are simply multiplying it. In this case it would be better to use a local variable to store the data:#10 CommentByMartinOn September 12, 2014 @ 18:05@Jan – I’m surprised

org.apache.commons.io.IOUtils.toByteArray()didn’t work for you. I checked R2009b to R2014b and they all bundle IO from Apache Commons (R2009b-R2012a has version 1.3.1, R2012b has 2.1, and R2013a-R2014b has 2.4). I tested on Linux (64-bit, R2009b-R2014a) and on OS X (64-bit, R2013a-R2014b) and in all cases my test script succeeded. I wonder what’s different in your configuration. This is my minimal test (that has to be put in a script file):#11 CommentByJan SimonOn October 18, 2014 @ 13:23James Tursa’s MEX implementation of TYPECASTX uses a shared data copy instead of a deep copy, which is much faster than the original Matlab version:

^{[15]}#12 CommentByYair AltmanOn October 18, 2014 @ 14:23@Jan – Thanks for the mention of Tursa’s TYPECASTX. I am of course familiar with it (and mention it in my upcoming book). Tursa’s TYPECASTX is indeed better than Matlab’s builtin

. However, in this specific case, type-casting usually takes only a small fraction of the overall time, so I felt it is better to use the built-in functions for simplicity.typecast#13 CommentByPéter MasaOn November 7, 2014 @ 00:42Similarly, it would be great to zip/unzip data into/from a Matlab variable, e.g.

This could be an invaluable tool when we want to keep in memory large amount of redundant data for maximum speed data access, e.g. picking rapidly images from thousands-of-images-database, which exceed the size of memory when it’s uncompressed.

#14 CommentByHolger HoffmannOn August 30, 2015 @ 14:39Great post – thx all!

Would you know a way to write a table directly to a zip/gz-file instead of writetable() to csv and subsequenly compressing it? As far as I know, in Java that’s possible.

Any help is greatly appreciated.

#15 CommentByYair AltmanOn August 31, 2015 @ 11:04@Holger – read the article carefully: I explained how you can serialize any Matlab data (including tables) into a byte stream that can then be compressed into a zip/gz file. You can use the

saveziputility for this as explained in the article.#16 CommentByHolger HoffmannOn September 3, 2015 @ 01:50Dear Yair,

thanks for the quick response. Your code works perfectly fine. My problem is, that the file written is a gzipped mat-file. At least, if I uncompress it with some other program than matlab, I cannot interpret it easily. For a specific application, I need to have a gzipped csv-file (table). That means, if I gunzip it with some other program, it should give a common csv-file. Now, I guess that is possible with your code. But I don’t understand how. Could you please give an example? This would really be helpful and boost my performance a lot.

Thanks again.

#17 CommentByYair AltmanOn September 3, 2015 @ 04:24You will need to manually extract your data into CSV format from Matlab’s table object. There is no direct way of doing that other than using

.writetable#18 CommentByHolger HoffmannOn October 20, 2015 @ 13:02Thanks for clarifying. That’s what I guessed. However, it is possible in Java to read and write into or out of a gzipped csv-file in the format of old-school spreadsheet (e.g. Excel-compatible). So shouldn’t it be possible with MatLab via Java too? Perhaps that would require to mimick the “writetable” somehow. Unfortunately I currently dont have the time to continue with this, but it would still be super great, if someone figures this out.