Serializing/deserializing Matlab data

Last year I wrote an article on improving the performance of the save function. The article discussed various ways by which we can store Matlab data on disk. However, in many cases we are interested in a byte-stream serialization, in order to transmit information to external processes.

The request to get a serialized byte-stream of Matlab data has been around for many years (example), but MathWorks has never released a documented way of serializing and unserializing data, except by storing onto a disk file and later loading it from file. Naturally, using a disk file significantly degrades performance. We could always use a RAM-disk or flash memory for improved performance, but in any case this seems like a major overkill to such a simple requirement.

In last year’s article, I presented a File Exchange utility for such generic serialization/deserialization. However, that utility is limited in the types of data that it supports, and while it is relatively fast, there is a much better, more generic and faster solution.

The solution appears to use the undocumented built-in functions getByteStreamFromArray and getArrayFromByteStream, which are apparently used internally by the save and load functions. The usage is very simple:

byteStream = getByteStreamFromArray(anyData);  % 1xN uint8 array
anyData = getArrayFromByteStream(byteStream);

Many Matlab functions, documented and undocumented alike, are defined in XML files within the %matlabroot%/bin/registry/ folder; our specific functions can be found in %matlabroot%/bin/registry/hgbuiltins.xml. While other functions include information about their location and number of input/output args, these functions do not. Their only XML attribute is type = ":all:", which seems to indicate that they accept all data types as input. Despite the fact that the functions are defined in hgbuiltins.xml, they are not limited to HG objects – we can serialize basically any Matlab data: structs, class objects, numeric/cell arrays, sparse data, Java handles, timers, etc. For example:

% Simple Matlab data
>> byteStream = getByteStreamFromArray(pi)  % 1x72 uint8 array
byteStream =
  Columns 1 through 19
    0    1   73   77    0    0    0    0   14    0    0    0   56    0    0    0    6    0    0
  Columns 20 through 38
    0    8    0    0    0    6    0    0    0    0    0    0    0    5    0    0    0    8    0
  Columns 39 through 57
    0    0    1    0    0    0    1    0    0    0    1    0    0    0    0    0    0    0    9
  Columns 58 through 72
    0    0    0    8    0    0    0   24   45   68   84  251   33    9   64
 
>> getArrayFromByteStream(byteStream)
ans =
          3.14159265358979
 
% A cell array of several data types
>> byteStream = getByteStreamFromArray({pi, 'abc', struct('a',5)});  % 1x312 uint8 array
>> getArrayFromByteStream(byteStream)
ans = 
    [3.14159265358979]    'abc'    [1x1 struct]
 
% A Java object
>> byteStream = getByteStreamFromArray(java.awt.Color.red);  % 1x408 uint8 array
>> getArrayFromByteStream(byteStream)
ans =
java.awt.Color[r=255,g=0,b=0]
 
% A Matlab timer
>> byteStream = getByteStreamFromArray(timer);  % 1x2160 uint8 array
>> getArrayFromByteStream(byteStream)
 
   Timer Object: timer-2
 
   Timer Settings
      ExecutionMode: singleShot
             Period: 1
           BusyMode: drop
            Running: off
 
   Callbacks
           TimerFcn: ''
           ErrorFcn: ''
           StartFcn: ''
            StopFcn: ''
 
% A Matlab class object
>> byteStream = getByteStreamFromArray(matlab.System);  % 1x1760 uint8 array
>> getArrayFromByteStream(byteStream)
ans = 
  System: matlab.System

Serializing HG objects

Of course, we can also serialize/deserialize also HG controls, plots/axes and even entire figures. When doing so, it is important to serialize the handle of the object, rather than its numeric handle, since we are interested in serializing the graphic object, not the scalar numeric value of the handle:

% Serializing a simple figure with toolbar and menubar takes almost 0.5 MB !
>> hFig = handle(figure);  % a new default Matlab figure
>> length(getByteStreamFromArray(hFig))
ans =
      479128
 
% Removing the menubar and toolbar removes much of this amount:
>> set(hFig, 'menuBar','none', 'toolbar','none')
>> length(getByteStreamFromArray(hFig))
ans =
       11848   %!!!
 
% Plot lines are not nearly as "expensive" as the toolbar/menubar
>> x=0:.01:5; hp=plot(x,sin(x));
>> byteStream = getByteStreamFromArray(hFig);
>> length(byteStream)
ans =
       33088
 
>> delete(hFig);
>> hFig2 = getArrayFromByteStream(byteStream)
hFig2 =
	figure

The interesting thing here is that when we deserialize a byte-stream of an HG object, it is automatically rendered onscreen. This could be very useful for persistence mechanisms of GUI applications. For example, we can save the figure handles in file so that if the application crashes and relaunches, it simply loads the file and we get exactly the same GUI state, complete with graphs and what-not, just as before the crash. Although the figure was deleted in the last example, deserializing the data caused the figure to reappear.

We do not need to serialize the entire figure. Instead, we could choose to serialize only a specific plot line or axes. For example:

>> x=0:0.01:5; hp=plot(x,sin(x));
>> byteStream = getByteStreamFromArray(handle(hp));  % 1x13080 uint8 array
>> hLine = getArrayFromByteStream(byteStream)
ans =
	graph2d.lineseries

This could also be used to easily clone (copy) any figure or other HG object, by simply calling getArrayFromByteStream (note the corresponding copyobj function, which I bet uses the same underlying mechanism).

Also note that unlike HG objects, deserialized timers are NOT automatically restarted; perhaps the Running property is labeled transient or dependent. Properties defined with these attributes are apparently not serialized.

Performance aspects

Using the builtin getByteStreamFromArray and getArrayFromByteStream functions can provide significant performance speedups when caching Matlab data. In fact, it could be used to store otherwise unsupported objects using the save -v6 or savefast alternatives, which I discussed in my save performance article. Robin Ince has shown how this can be used to reduce the combined caching/uncaching run-time from 115 secs with plain-vanilla save, to just 11 secs using savefast. Robin hasn’t tested this in his post, but since the serialized data is a simple uint8 array, it is intrinsically supported by the save -v6 option, which is the fastest alternative of all:

>> byteStream = getByteStreamFromArray(hFig);
>> tic, save('test.mat','-v6','byteStream'); toc
Elapsed time is 0.001924 seconds.
 
>> load('test.mat')
>> data = load('test.mat')
data = 
    byteStream: [1x33256 uint8]
>> getArrayFromByteStream(data.byteStream)
ans =
	figure

Moreover, we can now use java.util.Hashtable to store a cache map of any Matlab data, rather than use the much slower and more limited containers.Map class provided in Matlab.

Finally, note that as built-in functions, these functions could change without prior notice on any future Matlab release.

MEX interface – mxSerialize/mxDeserialize

To complete the picture, MEX includes a couple of undocumented functions mxSerialize and mxDeserialize, which correspond to the above functions. getByteStreamFromArray and getArrayFromByteStream apparently call them internally, since they provide the same results. Back in 2007, Brad Phelan wrote a MEX wrapper that could be used directly in Matlab (mxSerialize.c, mxDeserialize.c). The C interface was very simple, and so was the usage:

#include "mex.h"
 
EXTERN_C mxArray* mxSerialize(mxArray const *);
EXTERN_C mxArray* mxDeserialize(const void *, size_t);
 
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    if (nlhs && nrhs) {
          plhs[0] = (mxArray *) mxSerialize(prhs[0]);
        //plhs[0] = (mxArray *) mxDeserialize(mxGetData(prhs[0]), mxGetNumberOfElements(prhs[0]));
    }
}

Unfortunately, MathWorks has removed the C interface for these functions from libmx in R2014a, keeping only their C++ interfaces:

mxArray* matrix::detail::noninlined::mx_array_api::mxSerialize(mxArray const *anyData)
mxArray* matrix::detail::noninlined::mx_array_api::mxDeserialize(void const *byteStream, unsigned __int64 numberOfBytes)
mxArray* matrix::detail::noninlined::mx_array_api::mxDeserializeWithTag(void const *byteStream, unsigned __int64 numberOfBytes, char const* *tagName)

These are not the only MEX functions that were removed from libmx in R2014a. Hundreds of other C functions were also removed with them, some of them quite important (e.g., mxCreateSharedDataCopy). A few hundred new C++ functions were added in their place, but I fear that these are not accessible to MEX users without a code change (see below). libmx has always changed between Matlab releases, but not so drastically for many years. If you rely on any undocumented MEX functions in your code, now would be a good time to recheck it, before R2014a is officially released.

Thanks to Bastian Ebeling, we can still use these interfaces in our MEX code by simply renaming the MEX file from .c to .cpp and modifying the code as follows:

#include "mex.h"
 
// MX_API_VER has unfortunately not changed between R2013b and R2014a,
// so we use the new MATRIX_DLL_EXPORT_SYM as an ugly hack instead
#if defined(__cplusplus) && defined(MATRIX_DLL_EXPORT_SYM)
    #define EXTERN_C extern
    namespace matrix{ namespace detail{ namespace noninlined{ namespace mx_array_api{
#endif
 
EXTERN_C mxArray* mxSerialize(mxArray const *);
EXTERN_C mxArray* mxDeserialize(const void *, size_t);
// and so on, for any other MEX C functions that migrated to C++ in R2014a
 
#if defined(__cplusplus) && defined(MATRIX_DLL_EXPORT_SYM)
    }}}}
    using namespace matrix::detail::noninlined::mx_array_api;
#endif
 
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    if (nlhs && nrhs) {
        plhs[0] = (mxArray *) mxSerialize(prhs[0]);
      //plhs[0] = (mxArray *) mxDeserialize(mxGetData(prhs[0]), mxGetNumberOfElements(prhs[0]));
    }
}

Unfortunately, pre-R2014a code cannot coexist with R2014a code (since libmx is different), so separate MEX files need to be used depending on the Matlab version being used. This highlights the risk of using such unsupported functions.

The roundabout alternative is of course to use mexCallMATLAB to invoke getByteStreamFromArray and getArrayFromByteStream. This is actually rather silly, but it works…

p.s. – Happy 30th anniversary, MathWorks!

Addendum March 9, 2014

Now that the official R2014a has been released, I am happy to report that most of the important MEX functions that were removed in the pre-release have been restored in the official release. These include mxCreateSharedDataCopy, mxFastZeros, mxCreateUninitDoubleMatrix, mxCreateUninitNumericArray, mxCreateUninitNumericMatrix and mxGetPropertyShared. Unfortunately, mxSerialize and mxDeserialize remain among the functions that were left out, which is a real pity considering their usefulness, but we can use one of the workarounds mentioned above. At least those functions that were critical for in-place data manipulation and improved MATLAB performance have been restored, perhaps in some part due to lobbying by yours truly and by others.

MathWorks should be commended for their meaningful dialog with users and for making the fixes in such a short turn-around before the official release, despite the fact that they belong to the undocumented netherworld. MathWorks may appear superficially to be like any other corporate monolith, but when you scratch the surface you discover that there are people there who really care about users, not just the corporate bottom line. I must say that I really like this aspect of their corporate culture.

Related posts:

  1. Controlling plot data-tips Data-tips are an extremely useful plotting tool that can easily be controlled programmatically....
  2. Draggable plot data-tips Matlab's standard plot data-tips can be customized to enable dragging, without being limitted to be adjacent to their data-point. ...
  3. Accessing plot brushed data Plot data brushing can be accessed programmatically using very simple pure-Matlab code...
  4. Editbox data input validation Undocumented features of Matlab editbox uicontrols enable immediate user-input data validation...
  5. Sparse data math info Matlab contains multiple libraries for handling sparse data. These can report very detailed internal info. ...
  6. Undocumented Matlab MEX API Matlab's MEX API contains numerous undocumented functions, that can be extremely useful. ...

Categories: Handle graphics, High risk of breaking in future versions, Mex, Undocumented function

Tags: , , ,

Bookmark and SharePrint Print

10 Responses to Serializing/deserializing Matlab data

  1. Andre Kuehne says:

    There seems to be a ~4 GB size limitation for objects to be serialized with getByteStreamFromArray:

    a = rand(2^30,1,'single');
    aStream = getByteStreamFromArray(a);

    will result in

    Error using getByteStreamFromArray
    Error during serialization

    but using a slightly smaller array (haven’t found the exact overhead) will work

    a = rand(2^30-64,1,'single');
    aStream = getByteStreamFromArray(a);

    My data is usually larger than 4GB, so unfortunately I cannot use this method to quickly save it to files. Also, since I usually have complex (real/imag) data, I cannot use savefast either since it only supports real data. Do you have any ideas for accelerating saving in that scenario?

    • @Andre – you can split the data into <4GB chunks and/or into separate components for the real and imaginary portions.
      If you're using MEX and simple numeric arrays, then mxGetPr will get you a pointer to the real data and mxGetPi will get you a similar pointer to the imaginary data, that you can process separately.

  2. Dr. Bastian Ebeling says:

    As in your upper question – I’d like to tell you, I’ve successfully used some of those C++-Undocumented functions even in mex-files.
    Greets
    Bastian

  3. HexiuM says:

    I am not 100% sure that I understand the topic of data serialization, but while testing I found this strange (according to what I expected) behavior. If I give the following command:

    a = getByteStreamFromArray(2)

    I am getting a 72-column array. If I then issue:

    b = getByteStreamFromArray(3)

    I get the exact same array. Confirmed by:

    sum(abs(a-b))

    which gives 0.

    Then if I do:

    getArrayFromByteStream(a)

    I get 2
    and if I do

    getArrayFromByteStream(b)

    I get 3 (as expected). So the question comes down to (irrespective of what getArrayFromByteStream does) how can it give different results for the same input data (remember a==b)?

    Thanks!

  4. Ian says:

    Thanks Yair this is incredibly helpful, I’ve been wanting a better way to serialize matlab class objects to send across a TCP connection for ages, saving to disk was such a slow workaround…

  5. Andreas Martin says:

    Thank you, this would be very useful for me. Do you know if this is independent of platform and OS (32/64bit, little/big endian, win/linux)?

    • @Andreas – I don’t know. I assume so, because I believe that this is the underlying mechanism used by the save/load functions, but I cannot be certain since I do not have the source code. It should be easy enough to test, though.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

<pre lang="matlab">
a = magic(3);
sum(a)
</pre>