Last year I wrote an article on improving the performance of the save function. The article discussed various ways by which we can store Matlab data on disk. However, in many cases we are interested in a byte-stream serialization, in order to transmit information to external processes.
The request to get a serialized byte-stream of Matlab data has been around for many years (example), but MathWorks has never released a documented way of serializing and unserializing data, except by storing onto a disk file and later loading it from file. Naturally, using a disk file significantly degrades performance. We could always use a RAM-disk or flash memory for improved performance, but in any case this seems like a major overkill to such a simple requirement.
In last year’s article, I presented a File Exchange utility for such generic serialization/deserialization. However, that utility is limited in the types of data that it supports, and while it is relatively fast, there is a much better, more generic and faster solution.
The solution appears to use the undocumented built-in functions getByteStreamFromArray and getArrayFromByteStream, which are apparently used internally by the save and load functions. The usage is very simple:
byteStream = getByteStreamFromArray(anyData); % 1xN uint8 array anyData = getArrayFromByteStream(byteStream); |
Many Matlab functions, documented and undocumented alike, are defined in XML files within the %matlabroot%/bin/registry/ folder; our specific functions can be found in %matlabroot%/bin/registry/hgbuiltins.xml. While other functions include information about their location and number of input/output args, these functions do not. Their only XML attribute is type = ":all:"
, which seems to indicate that they accept all data types as input. Despite the fact that the functions are defined in hgbuiltins.xml, they are not limited to HG objects – we can serialize basically any Matlab data: structs, class objects, numeric/cell arrays, sparse data, Java handles, timers, etc. For example:
% Simple Matlab data >> byteStream = getByteStreamFromArray(pi) % 1x72 uint8 array byteStream = Columns 1 through 19 0 1 73 77 0 0 0 0 14 0 0 0 56 0 0 0 6 0 0 Columns 20 through 38 0 8 0 0 0 6 0 0 0 0 0 0 0 5 0 0 0 8 0 Columns 39 through 57 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 9 Columns 58 through 72 0 0 0 8 0 0 0 24 45 68 84 251 33 9 64 >> getArrayFromByteStream(byteStream) ans = 3.14159265358979 % A cell array of several data types >> byteStream = getByteStreamFromArray({pi, 'abc', struct('a',5)}); % 1x312 uint8 array >> getArrayFromByteStream(byteStream) ans = [3.14159265358979] 'abc' [1x1 struct] % A Java object >> byteStream = getByteStreamFromArray(java.awt.Color.red); % 1x408 uint8 array >> getArrayFromByteStream(byteStream) ans = java.awt.Color[r=255,g=0,b=0] % A Matlab timer >> byteStream = getByteStreamFromArray(timer); % 1x2160 uint8 array >> getArrayFromByteStream(byteStream) Timer Object: timer-2 Timer Settings ExecutionMode: singleShot Period: 1 BusyMode: drop Running: off Callbacks TimerFcn: '' ErrorFcn: '' StartFcn: '' StopFcn: '' % A Matlab class object >> byteStream = getByteStreamFromArray(matlab.System); % 1x1760 uint8 array >> getArrayFromByteStream(byteStream) ans = System: matlab.System |
Serializing HG objects
Of course, we can also serialize/deserialize also HG controls, plots/axes and even entire figures. When doing so, it is important to serialize the handle of the object, rather than its numeric handle, since we are interested in serializing the graphic object, not the scalar numeric value of the handle:
% Serializing a simple figure with toolbar and menubar takes almost 0.5 MB ! >> hFig = handle(figure); % a new default Matlab figure >> length(getByteStreamFromArray(hFig)) ans = 479128 % Removing the menubar and toolbar removes much of this amount: >> set(hFig, 'menuBar','none', 'toolbar','none') >> length(getByteStreamFromArray(hFig)) ans = 11848 %!!! % Plot lines are not nearly as "expensive" as the toolbar/menubar >> x=0:.01:5; hp=plot(x,sin(x)); >> byteStream = getByteStreamFromArray(hFig); >> length(byteStream) ans = 33088 >> delete(hFig); >> hFig2 = getArrayFromByteStream(byteStream) hFig2 = figure |
The interesting thing here is that when we deserialize a byte-stream of an HG object, it is automatically rendered onscreen. This could be very useful for persistence mechanisms of GUI applications. For example, we can save the figure handles in file so that if the application crashes and relaunches, it simply loads the file and we get exactly the same GUI state, complete with graphs and what-not, just as before the crash. Although the figure was deleted in the last example, deserializing the data caused the figure to reappear.
We do not need to serialize the entire figure. Instead, we could choose to serialize only a specific plot line or axes. For example:
>> x=0:0.01:5; hp=plot(x,sin(x)); >> byteStream = getByteStreamFromArray(handle(hp)); % 1x13080 uint8 array >> hLine = getArrayFromByteStream(byteStream) ans = graph2d.lineseries |
This could also be used to easily clone (copy) any figure or other HG object, by simply calling getArrayFromByteStream (note the corresponding copyobj function, which I bet uses the same underlying mechanism).
Also note that unlike HG objects, deserialized timers are NOT automatically restarted; perhaps the Running property is labeled transient
or dependent
. Properties defined with these attributes are apparently not serialized.
Performance aspects
Using the builtin getByteStreamFromArray and getArrayFromByteStream functions can provide significant performance speedups when caching Matlab data. In fact, it could be used to store otherwise unsupported objects using the save -v6 or savefast alternatives, which I discussed in my save performance article. Robin Ince has shown how this can be used to reduce the combined caching/uncaching run-time from 115 secs with plain-vanilla save, to just 11 secs using savefast. Robin hasn’t tested this in his post, but since the serialized data is a simple uint8
array, it is intrinsically supported by the save -v6 option, which is the fastest alternative of all:
>> byteStream = getByteStreamFromArray(hFig); >> tic, save('test.mat','-v6','byteStream'); toc Elapsed time is 0.001924 seconds. >> load('test.mat') >> data = load('test.mat') data = byteStream: [1x33256 uint8] >> getArrayFromByteStream(data.byteStream) ans = figure |
Moreover, we can now use java.util.Hashtable
to store a cache map of any Matlab data, rather than use the much slower and more limited containers.Map class provided in Matlab.
Finally, note that as built-in functions, these functions could change without prior notice on any future Matlab release.
MEX interface – mxSerialize/mxDeserialize
To complete the picture, MEX includes a couple of undocumented functions mxSerialize and mxDeserialize, which correspond to the above functions. getByteStreamFromArray and getArrayFromByteStream apparently call them internally, since they provide the same results. Back in 2007, Brad Phelan wrote a MEX wrapper that could be used directly in Matlab (mxSerialize.c, mxDeserialize.c). The C interface was very simple, and so was the usage:
#include "mex.h" EXTERN_C mxArray* mxSerialize(mxArray const *); EXTERN_C mxArray* mxDeserialize(const void *, size_t); void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { if (nlhs && nrhs) { plhs[0] = (mxArray *) mxSerialize(prhs[0]); //plhs[0] = (mxArray *) mxDeserialize(mxGetData(prhs[0]), mxGetNumberOfElements(prhs[0])); } } |
Unfortunately, MathWorks has removed the C interface for these functions from libmx in R2014a, keeping only their C++ interfaces:
mxArray* matrix::detail::noninlined::mx_array_api::mxSerialize(mxArray const *anyData) mxArray* matrix::detail::noninlined::mx_array_api::mxDeserialize(void const *byteStream, unsigned __int64 numberOfBytes) mxArray* matrix::detail::noninlined::mx_array_api::mxDeserializeWithTag(void const *byteStream, unsigned __int64 numberOfBytes, char const* *tagName) |
These are not the only MEX functions that were removed from libmx in R2014a. Hundreds of other C functions were also removed with them, some of them quite important (e.g., mxCreateSharedDataCopy). A few hundred new C++ functions were added in their place, but I fear that these are not accessible to MEX users without a code change (see below). libmx has always changed between Matlab releases, but not so drastically for many years. If you rely on any undocumented MEX functions in your code, now would be a good time to recheck it, before R2014a is officially released.
Thanks to Bastian Ebeling, we can still use these interfaces in our MEX code by simply renaming the MEX file from .c to .cpp and modifying the code as follows:
#include "mex.h" // MX_API_VER has unfortunately not changed between R2013b and R2014a, // so we use the new MATRIX_DLL_EXPORT_SYM as an ugly hack instead #if defined(__cplusplus) && defined(MATRIX_DLL_EXPORT_SYM) #define EXTERN_C extern namespace matrix{ namespace detail{ namespace noninlined{ namespace mx_array_api{ #endif EXTERN_C mxArray* mxSerialize(mxArray const *); EXTERN_C mxArray* mxDeserialize(const void *, size_t); // and so on, for any other MEX C functions that migrated to C++ in R2014a #if defined(__cplusplus) && defined(MATRIX_DLL_EXPORT_SYM) }}}} using namespace matrix::detail::noninlined::mx_array_api; #endif void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { if (nlhs && nrhs) { plhs[0] = (mxArray *) mxSerialize(prhs[0]); //plhs[0] = (mxArray *) mxDeserialize(mxGetData(prhs[0]), mxGetNumberOfElements(prhs[0])); } } |
Unfortunately, pre-R2014a code cannot coexist with R2014a code (since libmx is different), so separate MEX files need to be used depending on the Matlab version being used. This highlights the risk of using such unsupported functions.
The roundabout alternative is of course to use mexCallMATLAB to invoke getByteStreamFromArray and getArrayFromByteStream. This is actually rather silly, but it works…
p.s. – Happy 30th anniversary, MathWorks!
Addendum March 9, 2014
Now that the official R2014a has been released, I am happy to report that most of the important MEX functions that were removed in the pre-release have been restored in the official release. These include mxCreateSharedDataCopy, mxFastZeros, mxCreateUninitDoubleMatrix, mxCreateUninitNumericArray, mxCreateUninitNumericMatrix and mxGetPropertyShared. Unfortunately, mxSerialize and mxDeserialize remain among the functions that were left out, which is a real pity considering their usefulness, but we can use one of the workarounds mentioned above. At least those functions that were critical for in-place data manipulation and improved MATLAB performance have been restored, perhaps in some part due to lobbying by yours truly and by others.
MathWorks should be commended for their meaningful dialog with users and for making the fixes in such a short turn-around before the official release, despite the fact that they belong to the undocumented netherworld. MathWorks may appear superficially to be like any other corporate monolith, but when you scratch the surface you discover that there are people there who really care about users, not just the corporate bottom line. I must say that I really like this aspect of their corporate culture.
There seems to be a ~4 GB size limitation for objects to be serialized with getByteStreamFromArray:
will result in
but using a slightly smaller array (haven’t found the exact overhead) will work
My data is usually larger than 4GB, so unfortunately I cannot use this method to quickly save it to files. Also, since I usually have complex (real/imag) data, I cannot use savefast either since it only supports real data. Do you have any ideas for accelerating saving in that scenario?
@Andre – you can split the data into <4GB chunks and/or into separate components for the real and imaginary portions.
If you're using MEX and simple numeric arrays, then mxGetPr will get you a pointer to the real data and mxGetPi will get you a similar pointer to the imaginary data, that you can process separately.
@Andre – The limit is either 2 or 4 GB depending on the type of data, since the format uses 32-bit signed integers in some places and 32-bit unsigned integers in other places. If we stick to plain arrays, the limit is 2^32-1 bytes or entries in one dimensions, i.e. this stays within the limit (and thus works)
while
both fail.
If you use aggregate data types, e.g. cells or structs, then the limit is 4 GB. You can put 3 arrays of size 1 GB into a cell array and then successfully it.
(However, there’s an elegant way to get beyond this limitation. Watch out for one of my future comments.)
As in your upper question – I’d like to tell you, I’ve successfully used some of those C++-Undocumented functions even in mex-files.
Greets
Bastian
I am not 100% sure that I understand the topic of data serialization, but while testing I found this strange (according to what I expected) behavior. If I give the following command:
I am getting a 72-column array. If I then issue:
I get the exact same array. Confirmed by:
which gives 0.
Then if I do:
I get 2
and if I do
I get 3 (as expected). So the question comes down to (irrespective of what getArrayFromByteStream does) how can it give different results for the same input data (remember a==b)?
Thanks!
Oops…my mistake! sum(abs(a-b)) is obviously not right when handling uint8 data!
Sorry
@Hexium – you made a mistake. the 2 bytestream differ in the 71st element: it is 8 for getByteStreamFromArray(3) and 0 for getByteStreamFromArray(2).
Try:
And you’ll get 8, this is because the data type is u — int8a ~= b
Thanks Yair this is incredibly helpful, I’ve been wanting a better way to serialize matlab class objects to send across a TCP connection for ages, saving to disk was such a slow workaround…
Thank you, this would be very useful for me. Do you know if this is independent of platform and OS (32/64bit, little/big endian, win/linux)?
@Andreas – I don’t know. I assume so, because I believe that this is the underlying mechanism used by the save/load functions, but I cannot be certain since I do not have the source code. It should be easy enough to test, though.
As far as I know this is the same mechanism that is used to distribute data when using the Parallel Computing Toolbox or the MATLAB Distributed Computing Server (parfor, smpd, etc.). This works across multiple machines with different operating systems. I think the endianess is not really relevant as all currently supported platforms are litte endian. And since the format is 32-bit even on a 64-bit machine, I would be surprised that this would be an issue. In short: I’m fairly certain the format is highly portable.
[…] A few months ago I wrote about Matlab’s undocumented serialization/deserialization functions, getByteStreamFromArray and getArrayFromByteStream […]
Thanks Yair, this is a great option for saving of customized objects that control graphics features (HG1 *sigh*). I have noticed though that it looks like listener objects created by addlistener cannot be cleanly recreated from a byte stream. The warning suggests the constructor needs a name. Potentially some other objects may have similar vulnerabilities
It’s worth noting that
getByteStreamFromArray
callssaveobj
. For regular MATLAB variables, this won’t make any difference, but if you have a MATLAB class, you can overloadsaveobj
, and thengetByteStreamFromArray
will be serializing the output ofsaveobj
. (This is whyTransient
andDependent
properties are not serialized).Analogously,
getArrayFromByteStream
also callsloadobj
.[…] actually a very simple and robust built-in solution… as long as we’re comfortable with undocumented functionality. The function b=getByteStreamFromArray(v) converts a value to a uint8 array of […]
Thanks Yair, this would be very useful. but, there seems to be a bug for some objects to be deserialized with getArrayFromByteStream in deployed mode (dll). Take the AlexNet(we can get the alexnet from Add-On Explorer)for example: when we serialize and deserialized alexnet using the following code:
it performs well in both MATLAB or deployed mode.
but if we save netByte in a mat file in advance, then it does not work in dll mode:
We can not load the net2.mat. It doesn’t look like to be created correctly.
Do you have any ideas for solving this problem?
Thanks!
@Roc – try to convert your data to int16 before saving, and then convert back to uint8 after loading:
@Yair, Thank you very much. I tried your suggestion, but it still has the same result as previous. It is noteworthy that the following code works fine in MATLAB:
but when I use the following code
mcc -W cpplib:testDeserialized -T link:lib testDeserialized.m -C;
generate deployed files, such as testDeserialized.dll, testDeserialized.lib, testDeserialized.ctf and call testDeserialized.dll from C++ it has error when I tried to load D:\net00.mat. The error message is as follows:
when double click net00.mat,the error message changed to be
Do you have any good Suggestions? Thanks.
I solved the problem. The problem is that when you compile to DLL, if you use the following command
mcc -W cpplib:testDeserialized -T link:lib testDeserialized.m -C;
the compiler can’t accurately package all the functions that need to be dependent on all the deserialization. At this time, create an empty object that needs to be deserialized, save it, such as nullObj.mat and then pack it with the -a option, and the compiler will analyze the mat file and automatically find the fully dependent functions. As shown below
Thanks again @Yair
@Roc – thanks for the follow-up for the benefit of other readers
This is a great article! Does anyone know the format of the header information in the converted bytestream?
Thanks!