Once again I’d like to welcome guest blogger Peter Li. Peter wrote about Matlab Mex in-place editing last month. Today, Peter pokes around in Matlab’s internal memory representation to the greater good and glory of Matlab Mex programming.
Disclaimer: The information in this article is provided for informational purposes only. Be aware that poking into Matlab’s internals is not condoned or supported by MathWorks, and is not recommended for any regular usage. Poking into memory has the potential to crash your computer so save your data! Moreover, be advised (as the text below will show) that the information is highly prone to change without any advance notice in future Matlab releases, which could lead to very adverse effects on any program that relies on it. On the scale of undocumented Matlab topics, this practically breaks the scale, so be EXTREMELY careful when using this.
A few weeks ago I discussed Matlab’s copy-on-write mechanism as part of my discussion of editing Matlab arrays in-place. Today I want to explore some behind-the-scenes details of how the copy-on-write mechanism is implemented. In the process we will learn a little about Matlab’s internal array representation. I will also introduce some simple tools you can use to explore more of Matlab’s internals. I will only cover basic information, so there are plenty more details left to be filled in by others who are interested.
Brief review of copy-on-write and mxArray
Copy-on-write is Matlab’s mechanism for avoiding unnecessary duplication of data in memory. To implement this, Matlab needs to keep track internally of which sets of variables are copies of each other. As described in MathWorks’s article, “the Matlab language works with a single object type: the Matlab array. All Matlab variables (including scalars, vectors, matrices, strings, cell arrays, structures, and objects) are stored as Matlab arrays. In C/C++, the Matlab array is declared to be of type mxArray
“. This means that mxArray
defines how Matlab lays out all the information about an array (its Matlab data type, its size, its data, etc.) in memory. So understanding Matlab’s internal array representation basically boils down to understanding mxArray
.
Unfortunately, MathWorks also tells us that “mxArray
is a C language opaque type“. This means that MathWorks does not expose the organization of mxArray
to users (i.e. Matlab or Mex programmers). Instead, MathWorks defines mxArray
internally, and allows users to interact with it only through an API, a set of functions that know how to handle mxArray
in their back end. So, for example, a Mex programmer does not get the dimensions of an mxArray
by directly accessing the relevant field in memory. Instead, the Mex programmer only has a pointer to the mxArray
, and passes this pointer into an API function that knows where in memory to find the requested information and then passes the result back to the programmer.
This is generally a good thing: the API provides an abstraction layer between the programmer and the memory structures so that if MathWorks needs to change the back end organization (to add a new feature for example), we programmers do not need to modify our code; instead MathWorks just updates the API to reflect the new internal organization. On the other hand, being able to look into the internal structure of mxArray
on occasion can help us understand how Matlab works, and can help us write more efficient code if we are careful as in the example of editing arrays in-place.
So how do we get a glimpse inside mxArray
? The first step is simply to find the region of memory where the mxArray
lives: its beginning and end. Finding where in memory the mxArray
begins is pretty easy: it is given by its pointer value. Here is a simple Mex function that takes a Matlab array as input and prints its memory address:
/* printaddr.cpp */ #include "mex.h" void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { if (nrhs < 1) mexErrMsgTxt("One input required."); printf("%p\n", prhs[0]); } |
This function is nice as it prints the address in a standard hexadecimal format. The same information can also be received directly in Matlab (i.e., without needing printaddr), using the undocumented format debug command (here’s another reference):
>> format debug >> A = 1:10 A = Structure address = 7fc3b8869ae0 m = 1 n = 10 pr = 7fc44922c890 pi = 0 1 2 3 4 5 6 7 8 9 10 >> printaddr(A) 7fc3b8869ae0 |
To play with this further from within Matlab however, it’s nice to have the address returned to us as a 64-bit unsigned integer; here’s a Mex function that does that:
/* getaddr.cpp */ #include "mex.h" void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { if (nrhs < 1) mexErrMsgTxt("One input required."); plhs[0] = mxCreateNumericMatrix(1, 1, mxUINT64_CLASS, mxREAL); unsigned long *out = static_cast<unsigned long *>(mxGetData(plhs[0])); out[0] = (unsigned long) prhs[0]; } |
Here’s getaddr in action:
>> getaddr(A) ans = 139870853618400 % And using pure Matlab: >> hex2dec('7f36388b5ae0') % output of printaddr or format debug ans = 139870853618400 |
So now we know where to find our array in memory. With this information we can already learn a lot. To make our exploration a little cleaner though, it would be nice to know where the array ends in memory too, in other words we would like to know the size of the mxArray
.
Finding the structure of mxArray
The first thing to understand is that the amount of memory taken by an mxArray
does not have anything to do with the dimensions of the array in Matlab. So a 1×1 Matlab array and a 100×100 Matlab array have the same size mxArray
representation in memory. As you will know if you have experience programming in Mex, this is simply because the Matlab array’s data contents are not stored directly within mxArray
. Instead, mxArray
only stores a pointer to another memory location where the actual data reside. This is fine; the internal information we want to poke into is all still in mxArray
, and it is easy to get the pointer to the array’s data contents using the API functions mxGetData or mxGetPr.
So we are still left with trying to figure out the size of mxArray
. There are a couple paths forward. First I want to talk about a historical tool that used to make a lot of this internal information easily available. This was a function called headerdump, by Peter Boetcher (described here and here). headerdump was created for exactly the goal we are currently working towards: to understand Matlab’s copy-on-write mechanism. Unfortunately, as Matlab has evolved, newer versions have incrementally broken this useful tool. So our goal here is to create a replacement. Still, we can learn a lot from the earlier work.
One of the things that helped people figure out Matlab’s internals in the past is that in older versions of Matlab mxArray
is not a completely opaque type. Even in recent versions up through at least R2010a, if you look into $MATLAB/extern/include/matrix.h you can find a definition of mxArray_tag
that looks something like this:
/* R2010a */ struct mxArray_tag { void *reserved; int reserved1[2]; void *reserved2; size_t number_of_dims; unsigned int reserved3; struct { unsigned int flag0 : 1; unsigned int flag1 : 1; unsigned int flag2 : 1; unsigned int flag3 : 1; unsigned int flag4 : 1; unsigned int flag5 : 1; unsigned int flag6 : 1; unsigned int flag7 : 1; unsigned int flag7a : 1; unsigned int flag8 : 1; unsigned int flag9 : 1; unsigned int flag10 : 1; unsigned int flag11 : 4; unsigned int flag12 : 8; unsigned int flag13 : 8; } flags; size_t reserved4[2]; union { struct { void *pdata; void *pimag_data; void *reserved5; size_t reserved6[3]; } number_array; } data; }; |
This is what you could call murky or obfuscated, but not completely opaque. The fields mostly have unhelpful names like “reserved”, but on the other hand we at least have a sense for what fields there are and their layout.
A more informative (yet unofficial) definition was provided by James Tursa and Peter Boetcher:
#include "mex.h" /* Definition of structure mxArray_tag for debugging purposes. Might not be fully correct * for Matlab 2006b or 2007a, but the important things are. Thanks to Peter Boettcher. */ struct mxArray_tag { const char *name; mxClassID class_id; int vartype; mxArray *crosslink; int number_of_dims; int refcount; struct { unsigned int scalar_flag : 1; unsigned int flag1 : 1; unsigned int flag2 : 1; unsigned int flag3 : 1; unsigned int flag4 : 1; unsigned int flag5 : 1; unsigned int flag6 : 1; unsigned int flag7 : 1; unsigned int private_data_flag : 1; unsigned int flag8 : 1; unsigned int flag9 : 1; unsigned int flag10 : 1; unsigned int flag11 : 4; unsigned int flag12 : 8; unsigned int flag13 : 8; } flags; int rowdim; int coldim; union { struct { double *pdata; // original: void* double *pimag_data; // original: void* void *irptr; void *jcptr; int nelements; int nfields; } number_array; struct { mxArray **pdata; char *field_names; void *dummy1; void *dummy2; int dummy3; int nfields; } struct_array; struct { void *pdata; /*mxGetInfo*/ char *field_names; char *name; int checksum; int nelements; int reserved; } object_array; } data; }; |
For comparison, here is another definition from an earlier version of Matlab.
/* R11 aka Matlab 5.0 (1999) */ struct mxArray_tag { char name[mxMAXNAM]; int class_id; int vartype; mxArray *crosslink; int number_of_dims; int nelements_allocated; int dataflags; int rowdim; int coldim; union { struct { void *pdata; void *pimag_data; void *irptr; void *jcptr; int reserved; int nfields; } number_array; } data; }; |
I took this R11 definition from the source code to headerdump (specifically, from mxinternals.h, which also has mxArray_tag
definitions for R12 (Matlab 6.0) and R13 (Matlab 6.5)), and you can see that it is much more informative, because many fields have been given useful names thanks to the work of Peter Boetcher and others. Note also that the definition from this old version of Matlab is quite different from the version from R2010a.
At this point, if you are running a much earlier version of Matlab like R11 or R13, you can break off from the current article and start playing around with headerdump directly to try to understand Matlab’s internals. For more recent versions of Matlab, we have more work to do. Getting back to our original goal, if we take the mxArray_tag
definition from R2010a and run sizeof, we get an answer for the amount of memory taken up by an mxArray
in R2010a: 104 bytes.
Determining the size of mxArray
It was nice to derive the size of mxArray
from actual MathWorks code, but unfortunately this information is no longer available as of R2011a. Somewhere between R2010a and R2011a, MathWorks stepped up their efforts to make mxArray
completely opaque. So we should find another way to get the size of mxArray
for current and future Matlab versions.
One ugly trick that works is to create many new arrays quickly and see where their starting points end up in memory:
>> A = num2cell(1:100)'; >> addrs = sort(cellfun(@getaddr, A)); |
What we did here is create 100 new arrays, and then get all their memory addresses in sorted order. Now we can take a look at how far apart these new arrays ended up in memory:
>> semilogy(diff(addrs)); |
The resulting plot will look different each time you run this; it is not really predictable where Matlab will put new arrays into memory. Here is an example from my system:
Your results may look different, and you might have to increase the number of new arrays from 100 to 1000 to get the qualitative result, but the important feature of this plot is that there is a minimum distance between new arrays of about 102. In fact, if we just go straight for this minimum distance:
>> min(diff(addrs)) ans = 104 |
we find that although mxArray
has gone completely opaque from R2010a to R2011a, the full size of mxArray
in memory has stayed the same: 104 bytes.
Dumping mxArray from memory
We now have all the information we need to start looking into Matlab’s array representation. There are many tools available that allow you to browse memory locations or dump memory contents to disk. For our purposes though, it is nice to be able to do everything from within Matlab. Therefore I introduce a simple tool that prints memory locations into the Matlab console:
/* printmem.cpp */ #include "mex.h" void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { if (nrhs < 1 || !mxIsUint64(prhs[0]) || mxIsEmpty(prhs[0])) mexErrMsgTxt("First argument must be a uint64 memory address"); unsigned long *addr = static_cast<unsigned long *>(mxGetData(prhs[0])); unsigned char *mem = (unsigned char *) addr[0]; if (nrhs < 2 || !mxIsDouble(prhs[1]) || mxIsEmpty(prhs[1])) mexErrMsgTxt("Second argument must be a double-type integer byte size."); unsigned int nbytes = static_cast<unsigned int>(mxGetScalar(prhs[1])); for (int i = 0; i < nbytes; i++) { printf("%.2x ", mem[i]); if ((i+1) % 16 == 0) printf("\n"); } printf("\n"); } |
Here is how you use it in Matlab:
>> A = 0; >> printmem(getaddr(A), 104) 00 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 01 02 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 70 fa 33 df 6f 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
And there you have it: the inner guts of mxArray
laid bare. I have printed each byte as a two character hexadecimal value, as is standard, so there are 16 bytes printed per row.
What does it mean?
So now we have 104 bytes of Matlab internals to dig into. We can start playing with this with a few simple examples:
>> A = 0; B = 1; >> printmem(getaddr(A), 104) 00 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 01 02 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 c0 b0 27 df 6f 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> printmem(getaddr(B), 104) 00 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 01 02 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 70 fa 33 df 6f 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
In this and subsequent examples, I will highlight bytes that are different or that are of interest. What we can see from this example is that although arrays A and B have different content, almost nothing is different between their mxArray
representations. What is different, is the memory address stored in the highlighted bytes. This confirms our earlier assertion that mxArray
does not store the array contents, but only a pointer to the content location.
Now let us try to figure out some of the other fields:
>> A = 1:3; B = 1:10; C = (1:10)'; >> printmem(getaddr(A), 64) 00 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 60 80 22 df 6f 7f 00 00 >> printmem(getaddr(B), 64) 00 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 01 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 80 83 29 df 6f 7f 00 00 >> printmem(getaddr(C), 64) 00 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 0a 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 80 83 29 df 6f 7f 00 00
(Note that this time I only printed the first four lines of each array as this is where the interesting differences are for this example.)
In red I highlighted the bytes in each array that give its number of rows and columns (note that hexadecimal 0a is 10 in decimal). In blue I highlighted areas that store the value “02”, which could be the location for storing the number of dimensions. Let us look into this more:
>> A = rand([3 3 3]); >> printmem(getaddr(A), 64) 00 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 30 4a 3f df 6f 7f 00 00 09 00 00 00 00 00 00 00 b0 d3 24 df 6f 7f 00 00
Two interesting results here: The first highlighted region changed from 02 to 03, so this must be the place where mxArray
indicates a 3-dimensional array rather than 2D. Another important thing also changed though: we can see in the second highlighted region that there is a new memory address stored where we used to find the number of rows. And in the third highlighted region we now have the number 09 instead of the number of columns.
Clearly, Matlab has a different way of representing a 2D matrix versus arrays of higher dimension such as 3D. In the 2D case, mxArray
simply holds the nrows and ncols directly, but for a higher dimension case we hold only the number of dimensions (03), the total number of elements (09), and a pointer to another memory location (0x7f6fdf3f4a30) which holds the array of sizes for each dimension.
The copy-on-write mechanism
Finally, we are in a position to understand how Matlab internally implements copy-on-write:
>> A = 1:10; >> printmem(getaddr(A), 64); 00 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 01 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 90 f3 24 df 6f 7f 00 00 >> B = A; >> printaddr(B); 0x7f6f4c7b6810 >> printmem(getaddr(A), 64); 10 68 7b 4c 6f 7f 00 00 06 00 00 00 00 00 00 00 10 68 7b 4c 6f 7f 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 01 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 90 f3 24 df 6f 7f 00 00
What we see is that by setting B = A, we change the internal representation of A itself. Two new memory address pointers are added to the mxArray
for A. As it turns out, both of these point to the address for array B, which makes sense; this is how Matlab internally keeps track of arrays that are copies of each other. Note that because byte order is little-endian, the memory addresses from printmem are byte-wise, i.e. every two characters, reversed relative to the address from printaddr.
We can also look into array B:
>> printmem(getaddr(B), 64); f0 41 7a 4c 6f 7f 00 00 06 00 00 00 00 00 00 00 f0 41 7a 4c 6f 7f 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 01 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 90 f3 24 df 6f 7f 00 00 >> printaddr(A); 0x7f6f4c7a41f0
As I have highlighted, there are two interesting points here. First the red highlights show that array B has pointers back to array A. Second the blue highlight shows that the Matlab data for array B actually just points back to the same memory as the data for array A (the values 1:10).
Finally, we would like to understand why there are two pointers added. Let us see what happens if we add a third linked variable:
>> C = B; >> printaddr(A); printaddr(B); printaddr(C); 0x7f6f4c7a41f0 0x7f6f4c7b6810 0x7f6f4c7b69b0 >> printmem(getaddr(A), 32) b0 69 7b 4c 6f 7f 00 00 06 00 00 00 00 00 00 00 10 68 7b 4c 6f 7f 00 00 02 00 00 00 00 00 00 00 >> printmem(getaddr(B), 32) f0 41 7a 4c 6f 7f 00 00 06 00 00 00 00 00 00 00 b0 69 7b 4c 6f 7f 00 00 02 00 00 00 00 00 00 00 >> printmem(getaddr(C), 32) 10 68 7b 4c 6f 7f 00 00 06 00 00 00 00 00 00 00 f0 41 7a 4c 6f 7f 00 00 02 00 00 00 00 00 00 00
So it turns out that Matlab keeps track of a set of linked variables with a kind of cyclical, doubly-linked list structure; array A is linked to B in the forward direction and is also linked to C in the reverse direction by looping back around, etc. The cyclical nature of this makes sense, as we need to be able to start from any of A, B, or C and find all the linked arrays. But it is still not entirely clear why the list needs to be cyclical AND linked in both directions. In fact, in earlier versions of Matlab this cyclical list was only singly-linked.
Conclusions
Obviously there is a lot more to mxArray
and Matlab internals than what we have delved into here. Still, with this basic introduction I hope to have whet your appetite for understanding more about Matlab internals, and provided some simple tools to help you explore. I want to reiterate that in general MathWorks’s approach of an opaque mxArray
type with access abstracted through an API layer is a good policy. The last thing you would want to do is take the information here and write a bunch of code that relies on the structure of mxArray
to work; next time MathWorks needs to add a new feature and change mxArray
, all your code will break. So in general we are all better off playing within the API that MathWorks provides. And remember: poking into memory can crash your computer, so save your data!
On the other hand, occasionally there are cases, like in-place editing, where it is useful to push the capabilities of Matlab a little beyond what MathWorks envisioned. In these cases, having an understanding of Matlab’s internals can be critical, for example in understanding how to avoid conflicting with copy-on-write. Therefore I hope the information presented here will prove useful. Ideally, someone will be motivated to take this starting point and repair some of the tools like headerdump that made Matlab’s internal workings more transparent in the past. I believe that having more of this information out in the Matlab community is good for the community as a whole.
Hi Peter,
Thanks for this nifty bit of detective work. You wondered why the cycle of identical arrays is stored as a doubly linked list. My guess is that it is probably done for time-efficiency. Suppose for some reason you create a million copies of a single array, and then modify a single one. In order to re-close the loop, you would have to iterate over the entire cycle to get both loose ends. With a doubly-linked list this can be done in constant time.
Cheers,
Laurens
Thanks Laurens. Yes what you say makes sense. For some reason I was thinking about this in terms of traversing the cycle, but of course when we modify one element we must delete it from the cycle and that is the operation where the double linking will save you time (if your cycle is irrationally large :)).
In fact I have something like headerdump which works well with recent versions of MATLAB. I created it in order to reverse-engineer the internal representation of mxArray. My original motivation for doing this was to be able to track MATLAB’s memory allocation for big arrays because I’m on a NUMA system and I wanted to be able to check/enforce that often used variables reside on node-local memory.
For example, MATLAB distinguishes between normal variables (as shown in this post), temporary variables (when you pass immediate values to a function), global/persistent variables, and variables embedded in cell arrays and structs. Concerning the latter, given this post you could assume that
creates 100 doubly-linked instances of mxArray (sharing the same data — a 200×200 matrix) for the entries of x, but in fact there will be only one mxArray with a reference count set to 100.
If there’s sufficient interest I could imagine sharing my code (after some cleanup). Maybe I could also write a blog post similar to this one, but digging deeper into the internals (as far as I managed to understand them).
Sounds very interesting Martin. Indeed, the “reference” field is another particularly important subject that I believe was covered in part for older Matlab versions in a post by Benjamin Schubert: http://www.mk.tu-berlin.de/Members/Benjamin/mex_sharedArrays
I’m not aware of any more up-to-date write-up of this however. I would be particularly interested in better understanding how this interacts with mxUnshareArray and other copy-on-write mechanisms, some of which Benjamin’s article goes into.
If my goal here was successful, perhaps you will find some of the simple tools here useful either in your investigations or else just in demonstrating your findings succinctly.
I stumbled upon this blog when trying to find out how property ‘set’ methods are impacted if the property is an array. For example:
‘set’ is called for the indexed assignment, with an argument ‘s’ which is a ‘copy’ of This.array with exactly one element that is different. I was wondering Matlab could implement this sort of thing with a basic Copy on Write without causing some terrible inefficiencies. But reading the above I see that a more sophisticated COW could work reasonably.
Thanks.
Dear Mr. Altman,
First of all, I wish to congratulate you for the excellent work done with this “Undocumented Matlab” portal. There are lots of things actually undocumented in Matlab and that need to be clarified. Currenlty, I am an engineer student that is working with coders (H.264-HDTV) and therefore working at the bit level with Matlab. Bit level operations and specially algorithms that demand to consider byte alignments are quiet difficult in order to improve speed. While reading this article I was animated to consider to “break” Matlab’s FILE structure. Matlab has an option in the fread function to consider bit reding called “ubit”. This property is not seen in C, nor C++, nor in Octave. I wanted to ask all the persons involved in this blog if you believe that there is a special pointer “hidden” in Matlab’s “fread” structure in order to save the bit position. This is a topic that I been trying to discover in the last months and where I was not able to find information. If you could know something about this, I will appreciate your help.
Once more, congratulations for the great work done.
Best regards,
Christian Di
[…] it is possible to see high level C++ classes MathWorks developers use for work. Peter Li has posted an article about mxArray_tag‘s evolution and internal structure in 2012. […]
Starting the reversing work from the address returned by mxGetData caused some of the interesting mxArray fields to be missed (eg classID). I recently uploaded a similar work of mine to github, hopefully it adds value to anyone trying to watch mxArray’s (at least from visual studio).
Finding the size of the mxArray header is much simpler than what is shown here: