JIT – Undocumented Matlab

Callback functions performance

Yair Altman — Wed, 09 Sep 2015 23:36:25 +0000

Matlab enables a variety of ways to define callbacks for asynchronous events (such as interactive GUI actions or timer invocations). We can provide a function handle, a cell-array (of function handle and extra parameters), and in some cases also a string that will be eval‘ed in run-time. For example:

hButton = uicontrol(..., 'Callback', @myCallbackFunc);  % function handle
hButton = uicontrol(..., 'Callback', {@myCallbackFunc,data1,data2});  % cell-array
hButton = uicontrol(..., 'Callback', 'disp clicked!');  % string to eval

The first format, function handle, is by far the most common in Matlab code. This format has two variant: we can specify the direct handle to the function (as in @myCallbackFunc), or we could use an anonymous function, like this:

hButton = uicontrol(..., 'Callback', @(h,e) myCallbackFunc(h,e));  % anonymous function handle

All Matlab callbacks accept two input args by default: the control’s handle (hButton in this example), and a struct or object that contain the event’s data in internal fields. In our anonymous function variant, we therefore defined a function that accepts two input args (h,e) and calls myCallbackFunc(h,e).
These two variants are functionally equivalent:

hButton = uicontrol(..., 'Callback', @myCallbackFunc);             % direct function handle
hButton = uicontrol(..., 'Callback', @(h,e) myCallbackFunc(h,e));  % anonymous function handle

In my experience, the anonymous function variant is widely used – I see it extensively in many of my consulting clients’ code. Unfortunately, there could be a huge performance penalty when using this variant compared to a direct function handle, which many people are simply not aware of. I believe that even many MathWorkers are not aware of this, based on a recent conversation I’ve had with someone in the know, as well as from the numerous usage examples in internal Matlab code: see the screenshot below for some examples; there are numerous others scattered throughout the Matlab code corpus.
Part of the reason for this penalty not being well known may be that Matlab’s Profiler does not directly attribute the overheads. Here is a typical screenshot:

Profiling anonymous callback function performance

In this example, a heavily-laden GUI figure window was closed, triggering multiple cleanup callbacks, most of them belonging to internal Matlab code. Closing the figure took a whopping 8 secs. As can be seen from the screenshot, the callbacks themselves only take ~0.66 secs, and an additional 7.4 secs (92% of the total) is unattributed to any specific line. Think about it for a moment: we can only really see what’s happening in 8% of the time – the Profiler provides no clue about the root cause of the remaining 92%.
The solution in this case was to notice that the callback was defined using an anonymous function, @(h,e)obj.tableDeletedCallbackFcn(e). Changing all such instances to @obj.tableDeletedCallbackFcn (the function interface naturally needed to change to accept h as the first input arg) drastically cut the processing time, since direct function handles do not carry the same performance overheads as anonymous functions. In this specific example, closing the figure window now became almost instantaneous (<1 sec).

Conclusions

There are several morals that I think can be gained from this:

When we see unattributed time in the Profiler summary report, odds are high that this is due to function-call overheads. MathWorks have significantly reduced such overheads in the new R2015b (released last week), but anonymous [and to some degree also class methods] functions still carry a non-negligible invocation overheads that should be avoided if possible, by using direct [possibly non-MCOS] functions.
Use direct function handles rather than anonymous function handles, wherever possible
In the future, MathWorks will hopefully improve Matlab’s new engine (“LXE”) to automatically identify cases of @(h,e)func(h,e) and replace them with faster calls to @func, but in any case it would be wise to manually make this change in our code today. It would immediately improve readability, maintainability and performance, while still being entirely future-compatible.
In the future, MathWorks may also possibly improve the overheads of anonymous function invocations. This is more tricky than the straight-forward lexical substitution above, because anonymous functions need to carry the run-time workspace with them. This is a little known and certainly very little-used fact, which means that in practice most usage patterns of anonymous functions can be statically analyzed and converted into much faster direct function handles that carry no run-time workspace info. This is indeed tricky, but it could directly improve performance of many Matlab programs that naively use anonymous functions.
Matlab’s Profiler should really be improved to provide more information about unattributed time spent in internal Matlab code, to provide users clues that would help them reduce it. Some information could be gained by using the Profiler’s -detail builtin input args (which was documented until several releases ago, but then apparently became unsupported). I think that the Profiler should still be made to provide better insights in such cases.

Oh, and did I mention already the nice work MathWorks did with 15b’s LXE? Matlab’s JIT replacement was many years in the making, possibly since the mid 2000’s. We now see just the tip of the iceberg of this new engine: I hope that additional benefits will become apparent in future releases.
For a definitive benchmark of Matlab’s function-call overheads in various variants, readers are referred to Andrew Janke’s excellent utility (with some pre-15b usage results and analysis). Running this benchmark on my machine shows significant overhead reduction in function-call overheads in 15b in many (but not all) invocation types.
For those people wondering, 15b’s LXE does improve HG2’s performance, but just by a small bit – still not enough to offset the large performance hit of HG2 vs. HG1 in several key aspects. MathWorks is actively working to improve HG2’s performance, but unfortunately there is still no breakthrough as of 15b.
Additional details on various performance issues related to Matlab function calls (and graphics and anything else in Matlab) can be found in my recent book, Accelerating MATLAB Performance.

The post Callback functions performance appeared first on Undocumented Matlab.

Internal Matlab memory optimizations

Yair Altman — Wed, 30 May 2012 12:09:16 +0000

Yesterday I attended a seminar on developing trading strategies using Matlab. This is of interest to me because of my IB-Matlab product, and since many of my clients are traders in the financial sector. In the seminar, the issue of memory and performance naturally arose. It seemed to me that there was some confusion with regards to Matlab’s built-in memory optimizations. Since I discussed related topics in the past two weeks (preallocation performance, array resizing performance), these internal optimizations seemed a natural topic for today’s article.
The specific mechanisms I’ll describe today are Copy on Write (aka COW or Lazy Copying) and in-place data manipulations. Both mechanisms were already documented (for example, on Loren’s blog or on this blog). But apparently, they are still not well known. Understanting them could help Matlab users modify their code to improve performance and reduce memory consumption. So although this article is not entirely “undocumented”, I’ll give myself some slack today.

Copy on Write (COW, Lazy Copy)

Matlab implements an automatic copy-on-write (sometimes called copy-on-update or lazy copying) mechanism, which transparently allocates a temporary copy of the data only when it sees that the input data is modified. This improves run-time performance by delaying actual memory block allocation until absolutely necessary. COW has two variants: during regular variable copy operations, and when passing data as input parameters into a function:

1. Regular variable copies

When a variable is copied, as long as the data is not modified, both variables actually use the same shared memory block. The data is only copied onto a newly-allocated memory block when one of the variables is modified. The modified variable is assigned the newly-allocated block of memory, which is initialized with the values in the shared memory block before being updated:

data1 = magic(5000);  % 5Kx5K elements = 191 MB
data2 = data1;        % data1 & data2 share memory; no allocation done
data2(1,1) = 0;       % data2 allocated, copied and only then modified

If we profile our code using any of Matlab’s memory-profiling options, we will see that the copy operation data2=data1 takes negligible time to run and allocates no memory. On the other hand, the simple update operation data2(1,1)=0, which we could otherwise have assumed to take minimal time and memory, actually takes a relatively long time and allocates 191 MB of memory.

Copy-on-write effect monitored using the Profiler's -memory option

Copy-on-write effect monitored using Windows Process Explorer

We first see a memory spike (used during the computation of the magic square data), closely followed by a leveling off at 190.7MB above the baseline (this is due to allocation of data1). Copying data2=data1 has no discernible effect on either CPU or memory. Only when we set data2(1,1)=0 does the CPU return, in order to allocate the extra 190MB for data2. When we exit the test function, data1 and data2 are both deallocated, returning the Matlab process memory to its baseline level.
There are several lessons that we can draw from this simple example:
Firstly, creating copies of data does not necessarily or immediately impact memory and performance. Rather, it is the update of these copies which may be problematic. If we can modify our code to use more read-only data and less updated data copies, then we would improve performance. The Profiler report will show us exactly where in our code we have memory and CPU hotspots – these are the places we should consider optimizing.
Secondly, when we see such odd behavior in our Profiler reports (i.e., memory and/or CPU spikes that occur on seemingly innocent code lines), we should be aware of the copy-on-write mechanism, which could be the cause for the behavior.

2. Function input parameters

The copy-on-write mechanism behaves similarly for input parameters in functions: whenever a function is invoked (called) with input data, the memory allocated for this data is used up until the point that one of its copies is modified. At that point, the copies diverge: a new memory block is allocated, populated with data from the shared memory block, and assigned to the modified variable. Only then is the update done on the new memory block.

data1 = magic(5000);      % 5Kx5K elements = 191 MB
data2 = perfTest(data1);
function outData = perfTest(inData)
   outData = inData;   % inData & outData share memory; no allocation
   outData2(1,1) = 0;  % outData allocated, copied and then modified
end

Copy-on-write effect monitored using the Profiler's -memory option

One lesson that can be drawn from this is that whenever possible we should attempt to use functions that do not modify their input data. This is particularly true if the modified input data is very large. Read-only functions will be faster than functions that do even the simplest of data updates.
Another lesson is that perhaps counter intuitively, it does not make a difference from a performance standpoint to pass read-only data to functions as input parameters. We might think that passing large data objects around as function parameters will involve multiple memory allocations and deallocations of the data. In fact, it is only the data’s reference (or more precisely, its mxArray structure) which is being passed around and placed on the function’s call stack. Since this reference/structure is quite small in size, there are no real performance penalties. In fact, this only benefits code clarity and maintainability.
The only case where we may wish to use other means of passing data to functions is when a large data object needs to be updated. In such cases, the updated copy will be allocated to a new memory block with an associated performance cost.

In-place data manipulation

Matlab’s interpreter, at least in recent releases, has a very sophisticated algorithm for using in-place data manipulation (report). Modifying data in-place means that the original data block is modified, rather than creating a new block with the modified data, thus saving any memory allocations and deallocations.
For example, let us manipulate a simple 4Kx4K (122MB) numeric array:

>> m = magic(4000);   % 4Kx4K = 122MB
>> memory
Maximum possible array:            1022 MB (1.072e+09 bytes)
Memory available for all arrays:   1218 MB (1.278e+09 bytes)
Memory used by MATLAB:              709 MB (7.434e+08 bytes)
Physical Memory (RAM):             3002 MB (3.148e+09 bytes)
% In-place array data manipulation: no memory allocated
>> m = m * 0.5;
>> memory
Maximum possible array:            1022 MB (1.072e+09 bytes)
Memory available for all arrays:   1214 MB (1.273e+09 bytes)
Memory used by MATLAB:              709 MB (7.434e+08 bytes)
Physical Memory (RAM):             3002 MB (3.148e+09 bytes)
% New variable allocated, taking an extra 122MB of memory
>> m2 = m * 0.5;
>> memory
Maximum possible array:            1022 MB (1.072e+09 bytes)
Memory available for all arrays:   1092 MB (1.145e+09 bytes)
Memory used by MATLAB:              831 MB (8.714e+08 bytes)
Physical Memory (RAM):             3002 MB (3.148e+09 bytes)

The extra memory allocation of the not-in-place manipulation naturally translates into a performance loss:

% In-place data manipulation, no memory allocation
>> tic, m = m * 0.5; toc
Elapsed time is 0.056464 seconds.
% Regular data manipulation (122MB allocation) – 50% slower
>> clear m2; tic, m2 = m * 0.5; toc;
Elapsed time is 0.084770 seconds.

The difference may not seem large, but placed in a loop it could become significant indeed, and might be much more important if virtual memory swapping comes into play, or when Matlab’s memory space is exhausted (out-of-memory error).
Similarly, when returning data from a function, we should try to update the original data variable whenever possible, avoiding the need for allocation of a new variable:

% In-place data manipulation, no memory allocation
>> d=0:1e-7:1; tic, d = sin(d); toc
Elapsed time is 0.083397 seconds.
% Regular data manipulation (76MB allocation) – 50% slower
>> clear d2, d=0:1e-7:1; tic, d2 = sin(d); toc
Elapsed time is 0.121415 seconds.

Within the function itself we should ensure that we return the modified input variable, and not assign the output to a new variable, so that in-place optimization can also be applied within the function. The in-place optimization mechanism is smart enough to override Matlab’s default copy-on-write mechanism, which automatically allocates a new copy of the data when it sees that the input data is modified:

% Suggested practice: use in-place optimization within functions
function x = function1(x)
   x = someOperationOn(x);   % temporary variable x is NOT allocated
end
% Standard practice: prevents future use of in-place optimizations
function y = function2(x)
   y = someOperationOn(x);   % new temporary variable y is allocated
end

In order to benefit from in-place optimizations of function results, we must both use the same variable in the caller workspace (x = function1(x)) and also ensure that the called function is optimizable (e.g., function x = function1(x)) – if any of these two requirements is not met then in-place function-call optimization is not performed.
Also, for the in-place optimization to be active, we need to call the in-place function from within another function, not from a script or the Matlab Command Window.
A related performance trick is to use masks on the original data rather than temporary data copies. For example, suppose we wish to get the result of a function that acts on only a portion of some large data. If we create a temporary variable that holds the data subset and then process it, it would create an unnecessary copy of the original data:

% Original data
data = 0 : 1e-7 : 1;     % 10^7 elements, 76MB allocated
% Unnecessary copy of data into data2 (extra 8MB allocated)
data2 = data(data>0.1);  % 10^6 elements, 7.6MB allocated
results = sin(data2);    % another 10^6 elements, 7.6MB allocated
% Use of data masks obviates the need for temporary variable data2:
results = sin(data(data>0.1));  % no need for the data2 allocation

A note of caution: we should not invest undue efforts to use in-place data manipulation if the overall benefits would be negligible. It would only help if we have a real memory limitation issue and the data matrix is very large.
Matlab in-place optimization is a topic of continuous development. Code which is not in-place optimized today (for example, in-place manipulation on class object properties) may possibly be optimized in next year’s release. For this reason, it is important to write the code in a way that would facilitate the future optimization (for example, obj.x=2*obj.x rather than y=2*obj.x).
Some in-place optimizations were added to the JIT Accelerator as early as Matlab 6.5 R13, but Matlab 7.3 R2006b saw a major boost. As Matlab’s JIT Accelerator improves from release to release, we should expect in-place data manipulations to be automatically applied in an increasingly larger number of code cases.
In some older Matlab releases, and in some complex data manipulations where the JIT Accelerator cannot implement in-place processing, a temporary storage is allocated that is assigned to the original variable when the computation is done. To implement in-place data manipulations in such cases we could develop an external function (e.g., using Mex) that directly works on the original data block. Note that the officially supported mex update method is to always create deep-copies of the data using mxDuplicateArray() and then modify the new array rather than the original; modifying the original data directly is both discouraged and not officially supported. Doing it incorrectly can easily crash Matlab. If you do directly overwrite the original input data, at least ensure that you unshare any variables that share the same data memory block, thus mimicking the copy-on-write mechanism.
Using Matlab’s internal in-place data manipulation is very useful, especially since it is done automatically without need for any major code changes on our part. But sometimes we need certainty of actually processing the original data variable without having to guess or check whether the automated in-place mechanism will be activated or not. This can be achieved using several alternatives:

Using global or persistent variable
Using a parent-scope variable within a nested function
Modifying a reference (handle class) object’s internal properties

The post Internal Matlab memory optimizations appeared first on Undocumented Matlab.

Array resizing performance

Yair Altman — Wed, 23 May 2012 20:43:03 +0000

As I have explained last week, the best way to avoid the performance penalties associated with dynamic array resizing (typically, growth) in Matlab is to pre-allocate the array to its expected final size. I have shown different alternatives for such preallocation, but in all cases the performance is much better than using a naïve dynamic resize.
Unfortunately, such simple preallocation is not always possible. Apparently, all is not lost. There are still a few things we can do to mitigate the performance pain. As in last week, there is much more here than meets the eye at first sight.
The interesting newsgroup thread from 2005 about this issue that I mentioned last week contains two main solutions to this problem. The effects of these solutions is negligible for small data sizes and/or loop iterations (i.e., number of memory reallocations), but could be dramatic for large data arrays and/or a large number of memory reallocations. The difference could well mean the difference between a usable and an unusable (“hang”) program:

Factor growth: dynamic allocation by chunks

The idea is to dynamically grow the array by a certain percentage factor each time. When the array first needs to grow by a single element, we would in fact grow it by a larger chunk (say 40% of the current array size, for example by using the repmat function, or by concatenating a specified number of zeros, or by setting some way-forward index to 0), so that it would take the program some time before it needs to reallocate memory.
This method has a theoretical cost of N·log(N), which is nearly linear in N for most practical purposes. It is similar to preallocation in the sense that we are preparing a chunk of memory for future array use in advance. You might say that this is on-the-fly preallocation.

Using cell arrays

The idea here is to use cell arrays to store and grow the data, then use cell2mat to convert the resulting cell array to a regular numeric array. Cell elements are implemented as references to distinct memory blocks, so concatenating an object to a cell array merely concatenates its reference; when a cell array is reallocated, only its internal references (not the referenced data) are moved. Note that this relies on the internal implementation of cell arrays in Matlab, and may possibly change in some future release.
Like factor growth, using cell arrays is faster than quadratic behavior (although not quite as fast enough as we would have liked, of course). Different situations may favor using either the cell arrays method or the factor growth mechanism.

The growdata utility

John D’Errico has posted a well-researched utility called growdata that optimizes dynamic array growth for maximal performance. It is based in part on ideas mentioned in the aforementioned 2005 newsgroup thread, where growdata is also discussed in detail.
As an interesting side-note, John D’Errico also recently posted an extremely fast implementation of the Fibonacci function. The source code may seem complex, but the resulting performance gain is well worth the extra complexity. I believe that readers who will read this utility’s source code and understand its underlying logic will gain insight into several performance tricks that could be very useful in general.

Effects of incremental JIT improvements

The introduction of JIT Acceleration in Matlab 6.5 (R13) caused a dramatic boost in performance (there is an internal distinction between the Accelerator and JIT: JIT is apparently only part of the Matlab Accelerator, but this distinction appears to have no practical impact on the discussion here).
Over the years, MathWorks has consistently improved the efficiency of its computational engine and the JIT Accelerator in particular. JIT was consistently improved since that release, giving a small improvement with each new Matlab release. In Matlab 7.11 (R2010b), the short Fibonacci snippet used in last week’s article showed executed about 30% faster compared to Matlab 7.1 R14SP3. The behavior was still quadratic in nature, and so in these releases, using any of the above-mentioned solutions could prove very beneficial.
In Matlab 7.12 (R2011a), a major improvement was done in the Matlab engine (JIT?). The execution run-times improved significantly, and in addition have become linear in nature. This means that multiplying the array size by N only degrades performance by N, not N² – an impressive achievement:

% This is ran on Matlab 7.14 (R2012a):
clear f, tic, f=[0,1]; for idx=3:10000, f(idx)=f(idx-1)+f(idx-2); end, toc
   => Elapsed time is 0.004924 seconds.  % baseline loop size, & exec time
clear f, tic, f=[0,1]; for idx=3:20000, f(idx)=f(idx-1)+f(idx-2); end, toc
   => Elapsed time is 0.009971 seconds.  % x2 loop size, x2 execution time
clear f, tic, f=[0,1]; for idx=3:40000, f(idx)=f(idx-1)+f(idx-2); end, toc
   => Elapsed time is 0.019282 seconds.  % x4 loop size, x4 execution time

In fact, it turns out that using either the cell arrays method or the factor growth mechanism is much slower in R2011a than using the naïve dynamic growth!
This teaches us a very important lesson: It is not wise to program against a specific implementation of the engine, at least not in the long run. While this may yield performance benefits on some Matlab releases, the situation may well be reversed on some future release. This might force us to retest, reprofile and potentially rewrite significant portions of code for each new release. Obviously this is not a maintainable solution. In practice, most code that is written on some old Matlab release would likely we carried over with minimal changes to the newer releases. If this code has release-specific tuning, we could be shooting ourselves in the leg in the long run.
MathWorks strongly advises (and again, and once again), and I concur, to program in a natural manner, rather than in a way that is tailored to a particular Matlab release (unless of course we can be certain that we shall only be using that release and none other). This will improve development time, maintainability and in the long run also performance.
(and of course you could say that a corollary lesson is to hurry up and get the latest Matlab release…)

Variants for array growth

If we are left with using a naïve dynamic resize, there are several equivalent alternatives for doing this, having significantly different performances:

% This is ran on Matlab 7.12 (R2011a):
% Variant #1: direct assignment into a specific out-of-bounds index
data=[]; tic, for idx=1:100000; data(idx)=1; end, toc
   => Elapsed time is 0.075440 seconds.
% Variant #2: direct assignment into an index just outside the bounds
data=[]; tic, for idx=1:100000; data(end+1)=1; end, toc
   => Elapsed time is 0.241466 seconds.    % 3 times slower
% Variant #3: concatenating a new value to the array
data=[]; tic, for idx=1:100000; data=[data,1]; end, toc
   => Elapsed time is 22.897688 seconds.   % 300 times slower!!!

As can be seen, it is much faster to directly index an out-of-bounds element as a means to force Matlab to enlarge a data array, rather than using the end+1 notation, which needs to recalculate the value of end each time.
In any case, try to avoid using the concatenation variant, which is significantly slower than either of the other two alternatives (300 times slower in the above example!). In this respect, there is no discernible difference between using the [] operator or the cat() function for the concatenation.
Apparently, the JIT performance boost gained in Matlab R2011a does not work for concatenation. Future JIT improvements may possibly also improve the performance of concatenations, but in the meantime it is better to use direct indexing instead.
The effect of the JIT performance boost is easily seen when we run the same variants on pre-R2011a Matlab releases. The corresponding values are 30.9, 34.8 and 34.3 seconds. Using direct indexing is still the fastest approach, but concatenation is now only 10% slower, not 300 times slower.
When we need to append a non-scalar element (for example, a 2D matrix) to the end of an array, we might think that we have no choice but to use the slow concatenation method. This assumption is incorrect: we can still use the much faster direct-indexing method, as shown below (notice the non-linear growth in execution time for the concatenation variant):

% This is ran on Matlab 7.12 (R2011a):
matrix = magic(3);
% Variant #1: direct assignment – fast and linear cost
data=[]; tic, for idx=1:10000; data(:,(idx*3-2):(idx*3))=matrix; end, toc
   => Elapsed time is 0.969262 seconds.
data=[]; tic, for idx=1:100000; data(:,(idx*3-2):(idx*3))=matrix; end, toc
   => Elapsed time is 9.558555 seconds.
% Variant #2: concatenation – much slower, quadratic cost
data=[]; tic, for idx=1:10000; data=[data,matrix]; end, toc
   => Elapsed time is 2.666223 seconds.
data=[]; tic, for idx=1:100000; data=[data,matrix]; end, toc
   => Elapsed time is 356.567582 seconds.

As the size of the array enlargement element (in this case, a 3×3 matrix) increases, the computer needs to allocate more memory space more frequently, thereby increasing execution time and the importance of preallocation. Even if the system has an internal memory-management mechanism that enables it to expand into adjacent (contiguous) empty memory space, as the size of the enlargement grows the empty space will run out sooner and a new larger memory block will need to be allocated more frequently than in the case of small incremental enlargements of a single 8-byte double.

Other alternatives

If preallocation is not possible, JIT is not very helpful, vectorization is out of the question, and rewriting the problem so that it doesn’t need dynamic array growth is impossible – if all these are not an option, then consider using one of the following alternatives for array growth (read again the interesting newsgroup thread from 2005 about this issue):

Dynamically grow the array by a certain percentage factor each time the array runs out of space (on-the-fly preallocation)
Use John D’Errico’s growdata utility
Use cell arrays to store and grow the data, then use cell2mat to convert the resulting cell array to a regular numeric array
Reuse an existing data array that has the necessary storage space
Wrap the data in a referential object (a class object that inherits from handle), then append the reference handle rather than the original data (ref). Note that if your class object does not inherit from handle, it is not a referential object but rather a value object, and as such it will be appended in its entirety to the array data, losing any performance benefits. Of course, it may not always be possible to wrap our class objects as a handle.
References have a much small memory footprint than the objects that they reference. The objects themselves will remain somewhere in memory and will not need to be moved whenever the data array is enlarged and reallocated – only the small-footprint reference will be moved, which is much faster. This is also the reason that cell concatenation is faster than array concatenations for large objects.

The post Array resizing performance appeared first on Undocumented Matlab.

Preallocation performance

Yair Altman — Wed, 16 May 2012 12:14:46 +0000

Array preallocation is a standard and quite well-known technique for improving Matlab loop run-time performance. Today’s article will show that there is more than meets the eye for even such a simple coding technique.
A note of caution: in the examples that follow, don’t take any speedup as an expected actual value – the actual value may well be different on your system. Your mileage may vary. I only mean to display the relative differences between different alternatives.

The underlying problem

Memory management has a direct influence on performance. I have already shown some examples of this in past articles here.
Preallocation solves a basic problem in simple program loops, where an array is iteratively enlarged with new data (dynamic array growth). Unlike other programming languages (such as C, C++, C# or Java) that use static typing, Matlab uses dynamic typing. This means that it is natural and easy to modify array size dynamically during program execution. For example:

fibonacci = [0, 1];
for idx = 3 : 100
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end

While this may be simple to program, it is not wise with regards to performance. The reason is that whenever an array is resized (typically enlarged), Matlab allocates an entirely new contiguous block of memory for the array, copying the old values from the previous block to the new, then releasing the old block for potential reuse. This operation takes time to execute. In some cases, this reallocation might require accessing virtual memory and page swaps, which would take an even longer time to execute. If the operation is done in a loop, then performance could quickly drop off a cliff.
The cost of such naïve array growth is theoretically quadratic. This means that multiplying the number of elements by N multiplies the execution time by about N². The reason for this is that Matlab needs to reallocate N times more than before, and each time takes N times longer due to the larger allocation size (the average block size multiplies by N), and N times more data elements to copy from the old to the new memory blocks.
A very interesting discussion of this phenomenon and various solutions can be found in a newsgroup thread from 2005. Three main solutions were presented: preallocation, selective dynamic growth (allocating headroom) and using cell arrays. The best solution among these in terms of ease of use and performance is preallocation.

The basics of pre-allocation

The basic idea of preallocation is to create a data array in the final expected size before actually starting the processing loop. This saves any reallocations within the loop, since all the data array elements are already available and can be accessed. This solution is useful when the final size is known in advance, as the following snippet illustrates:

% Regular dynamic array growth:
tic
fibonacci = [0,1];
for idx = 3 : 40000
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end
toc
   => Elapsed time is 0.019954 seconds.
% Now use preallocation – 5 times faster than dynamic array growth:
tic
fibonacci = zeros(40000,1);
fibonacci(1)=0; fibonacci(2)=1;
for idx = 3 : 40000,
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end
toc
   => Elapsed time is 0.004132 seconds.

On pre-R2011a releases the effect of preallocation is even more pronounced: I got a 35-times speedup on the same machine using Matlab 7.1 (R14 SP3). R2011a (Matlab 7.12) had a dramatic performance boost for such cases in the internal accelerator, so newer releases are much faster in dynamic allocations, but preallocation is still 5 times faster even on R2011a.

Non-deterministic pre-allocation

Because the effect of preallocation is so dramatic on all Matlab releases, it makes sense to utilize it even in cases where the data array’s final size is not known in advance. We can do this by estimating an upper bound to the array’s size, preallocate this large size, and when we’re done remove any excess elements:

% The final array size is unknown – assume 1Kx3K upper bound (~23MB)
data = zeros(1000,3000);  % estimated maximal size
numRows = 0;
numCols = 0;
while (someCondition)
   colIdx = someValue1;   numCols = max(numCols,colIdx);
   rowIdx = someValue2;   numRows = max(numRows,rowIdx);
   data(rowIdx,colIdx) = someOtherValue;
end
% Now remove any excess elements
data(:,numCols+1:end) = [];   % remove excess columns
data(numRows+1:end,:) = [];   % remove excess rows

Variants for pre-allocation

It turns out that MathWorks’ official suggestion for preallocation, namely using the zeros function, is not the most efficient:

% MathWorks suggested variant
clear data1, tic, data1 = zeros(1000,3000); toc
   => Elapsed time is 0.016907 seconds.
% A much faster alternative - 500 times faster!
clear data1, tic, data1(1000,3000) = 0; toc
   => Elapsed time is 0.000034 seconds.

The reason for the second variant being so much faster is because it only allocates the memory, without worrying about the internal values (they get a default of 0, false or ”, in case you wondered). On the other hand, zeros has to place a value in each of the allocated locations, which takes precious time.
In most cases the differences are immaterial since the preallocation code would only run once in the program, and an extra 17ms isn’t such a big deal. But in some cases we may have a need to periodically refresh our data, where the extra run-time could quickly accumulate.
Update (October 27, 2015): As Marshall notes below, this behavior changed in R2015b when the new LXE (Matlab’s new execution engine) replaced the previous engine. In R2015b, the zeros function is faster than the alternative of just setting the last array element to 0. Similar changes may also have occurred to the following post content, so if you are using R2015b onward, be sure to test carefully on your specific system.

Pre-allocating non-default values

When we need to preallocate a specific value into every data array element, we cannot use Variant #2. The reason is that Variant #2 only sets the very last data element, and all other array elements get assigned the default value (0, ‘’ or false, depending on the array’s data type). In this case, we can use one of the following alternatives (with their associated timings for a 1000×3000 data array):

scalar = pi;  % for example...
data = scalar(ones(1000,3000));           % Variant A: 87.680 msecs
data(1:1000,1:3000) = scalar;             % Variant B: 28.646 msecs
data = repmat(scalar,1000,3000);          % Variant C: 17.250 msecs
data = scalar + zeros(1000,3000);         % Variant D: 17.168 msecs
data(1000,3000) = 0; data = data+scalar;  % Variant E: 16.334 msecs

As can be seen, Variants C-E are about twice as fast as Variant B, and 5 times faster than Variant A.

Pre-allocating non-double data

7.4.5 Preallocating non-double data
When preallocating an array of a type that is not double, we should be careful to create it using the desired type, to prevent memory and/or performance inefficiencies. For example, if we need to process a large array of small integers (int8), it would be inefficient to preallocate an array of doubles and type-convert to/from int8 within every loop iteration. Similarly, it would be inefficient to preallocate the array as a double type and then convert it to int8. Instead, we should create the array as an int8 array in the first place:

% Bad idea: allocates 8MB double array, then converts to 1MB int8 array
data = int8(zeros(1000,1000));   % 1M elements
   => Elapsed time is 0.008170 seconds.
% Better: directly allocate the array as a 1MB int8 array – x80 faster
data = zeros(1000,1000,'int8');
   => Elapsed time is 0.000095 seconds.

Pre-allocating cell arrays

To preallocate a cell-array we can use the cell function (explicit preallocation), or the maximal cell index (implicit preallocation). Explicit preallocation is faster than implicit preallocation, but functionally equivalent (Note: this is contrary to the experience with allocation of numeric arrays and other arrays):

% Variant #1: Explicit preallocation of a 1Kx3K cell array
data = cell(1000,3000);
   => Elapsed time is 0.004637 seconds.
% Variant #2: Implicit preallocation – x3 slower than explicit
clear('data'), data{1000,3000} = [];
   => Elapsed time is 0.012873 seconds.

Pre-allocating arrays of structs

To preallocate an array of structs or class objects, we can use the repmat function to replicate copies of a single data element (explicit preallocation), or just use the maximal data index (implicit preallocation). In this case, unlike the case of cell arrays, implicit preallocation is much faster than explicit preallocation, since the single element does not actually need to be copied multiple times (ref):

% Variant #1: Explicit preallocation of a 100x300 struct array
element = struct('field1',magic(2), 'field2',{[]});
data = repmat(element, 100, 300);
   => Elapsed time is 0.002804 seconds.
% Variant #2: Implicit preallocation – x7 faster than explicit
element = struct('field1',magic(2), 'field2',{[]});
clear('data'), data(100,300) = element;
   => Elapsed time is 0.000429 seconds.

When preallocating structs, we can also use a third variant, using the built-in struct feature of replicating the struct when the struct function is passed a cell array. For example, struct('field1',cell(100,1), 'field2',5) will create 100 structs, each of them having the empty field field1 and another field called field2 with value 5. Unfortunately, this variant is slower than both of the previous variants.

Pre-allocating class objects

When preallocating in general, ensure that you are using the maximal expected array size. There is no point in preallocating an empty array or an array having a smaller size than the expected maximum, since dynamic memory reallocation will automatically kick-in within the processing-loop. For this reason, do not use the empty() method of class objects to preallocate, but rather repmat as explained above.
When using repmat to replicate class objects, always be careful to note whether you are replicating the object itself (this happens if your class does NOT derive from handle) or its reference handle (which happens if you derive the class from handle). If you are replicating objects, then you can safely edit any of their properties independently of each other; but if you replicate references, you are merely using multiple copies of the same reference, so that modifying referenced object #1 will also automatically affect all the other referenced objects. This may or may not be suitable for your particular program requirements, so be careful to check carefully. If you actually need to use independent object copies, you will need to call the class constructor multiple times, once for each new independent object.

Next week: what if we can’t avoid dynamic array resizing? – apparently, all is not lost. Stay tuned…

Do you have any similar allocation-related tricks you’re using? or unexpected differences such as the ones shown above? If so, then please do post a comment.

The post Preallocation performance appeared first on Undocumented Matlab.

JIT – Undocumented Matlab

Callback functions performance

Conclusions

Related posts:

Internal Matlab memory optimizations

Copy on Write (COW, Lazy Copy)

1. Regular variable copies

2. Function input parameters

In-place data manipulation

Related posts:

Array resizing performance

Factor growth: dynamic allocation by chunks

Using cell arrays

The growdata utility

Effects of incremental JIT improvements

Variants for array growth

Other alternatives

Related posts:

Preallocation performance

The underlying problem

The basics of pre-allocation

Non-deterministic pre-allocation

Variants for pre-allocation

Pre-allocating non-default values

Pre-allocating non-double data

Pre-allocating cell arrays

Pre-allocating arrays of structs

Pre-allocating class objects

Related posts: