As I have explained last week, the best way to avoid the performance penalties associated with dynamic array resizing (typically, growth) in Matlab is to pre-allocate the array to its expected final size. I have shown different alternatives for such preallocation, but in all cases the performance is much better than using a naïve dynamic resize.
Unfortunately, such simple preallocation is not always possible. Apparently, all is not lost. There are still a few things we can do to mitigate the performance pain. As in last week, there is much more here than meets the eye at first sight.
The interesting newsgroup thread from 2005 about this issue that I mentioned last week contains two main solutions to this problem. The effects of these solutions is negligible for small data sizes and/or loop iterations (i.e., number of memory reallocations), but could be dramatic for large data arrays and/or a large number of memory reallocations. The difference could well mean the difference between a usable and an unusable (“hang”) program:
Factor growth: dynamic allocation by chunks
The idea is to dynamically grow the array by a certain percentage factor each time. When the array first needs to grow by a single element, we would in fact grow it by a larger chunk (say 40% of the current array size, for example by using the repmat function, or by concatenating a specified number of zeros, or by setting some way-forward index to 0), so that it would take the program some time before it needs to reallocate memory.
This method has a theoretical cost of N·log(N), which is nearly linear in N for most practical purposes. It is similar to preallocation in the sense that we are preparing a chunk of memory for future array use in advance. You might say that this is on-the-fly preallocation.
Using cell arrays
The idea here is to use cell arrays to store and grow the data, then use cell2mat to convert the resulting cell array to a regular numeric array. Cell elements are implemented as references to distinct memory blocks, so concatenating an object to a cell array merely concatenates its reference; when a cell array is reallocated, only its internal references (not the referenced data) are moved. Note that this relies on the internal implementation of cell arrays in Matlab, and may possibly change in some future release.
Like factor growth, using cell arrays is faster than quadratic behavior (although not quite as fast enough as we would have liked, of course). Different situations may favor using either the cell arrays method or the factor growth mechanism.
The growdata utility
John D’Errico has posted a well-researched utility called growdata that optimizes dynamic array growth for maximal performance. It is based in part on ideas mentioned in the aforementioned 2005 newsgroup thread, where growdata is also discussed in detail.
As an interesting side-note, John D’Errico also recently posted an extremely fast implementation of the Fibonacci function. The source code may seem complex, but the resulting performance gain is well worth the extra complexity. I believe that readers who will read this utility’s source code and understand its underlying logic will gain insight into several performance tricks that could be very useful in general.
Effects of incremental JIT improvements
The introduction of JIT Acceleration in Matlab 6.5 (R13) caused a dramatic boost in performance (there is an internal distinction between the Accelerator and JIT: JIT is apparently only part of the Matlab Accelerator, but this distinction appears to have no practical impact on the discussion here).
Over the years, MathWorks has consistently improved the efficiency of its computational engine and the JIT Accelerator in particular. JIT was consistently improved since that release, giving a small improvement with each new Matlab release. In Matlab 7.11 (R2010b), the short Fibonacci snippet used in last week’s article showed executed about 30% faster compared to Matlab 7.1 R14SP3. The behavior was still quadratic in nature, and so in these releases, using any of the above-mentioned solutions could prove very beneficial.
In Matlab 7.12 (R2011a), a major improvement was done in the Matlab engine (JIT?). The execution run-times improved significantly, and in addition have become linear in nature. This means that multiplying the array size by N only degrades performance by N, not N2 – an impressive achievement:
% This is ran on Matlab 7.14 (R2012a): clear f, tic, f=[0,1]; for idx=3:10000, f(idx)=f(idx-1)+f(idx-2); end, toc => Elapsed time is 0.004924 seconds. % baseline loop size, & exec time clear f, tic, f=[0,1]; for idx=3:20000, f(idx)=f(idx-1)+f(idx-2); end, toc => Elapsed time is 0.009971 seconds. % x2 loop size, x2 execution time clear f, tic, f=[0,1]; for idx=3:40000, f(idx)=f(idx-1)+f(idx-2); end, toc => Elapsed time is 0.019282 seconds. % x4 loop size, x4 execution time |
In fact, it turns out that using either the cell arrays method or the factor growth mechanism is much slower in R2011a than using the naïve dynamic growth!
This teaches us a very important lesson: It is not wise to program against a specific implementation of the engine, at least not in the long run. While this may yield performance benefits on some Matlab releases, the situation may well be reversed on some future release. This might force us to retest, reprofile and potentially rewrite significant portions of code for each new release. Obviously this is not a maintainable solution. In practice, most code that is written on some old Matlab release would likely we carried over with minimal changes to the newer releases. If this code has release-specific tuning, we could be shooting ourselves in the leg in the long run.
MathWorks strongly advises (and again, and once again), and I concur, to program in a natural manner, rather than in a way that is tailored to a particular Matlab release (unless of course we can be certain that we shall only be using that release and none other). This will improve development time, maintainability and in the long run also performance.
(and of course you could say that a corollary lesson is to hurry up and get the latest Matlab release…)
Variants for array growth
If we are left with using a naïve dynamic resize, there are several equivalent alternatives for doing this, having significantly different performances:
% This is ran on Matlab 7.12 (R2011a): % Variant #1: direct assignment into a specific out-of-bounds index data=[]; tic, for idx=1:100000; data(idx)=1; end, toc => Elapsed time is 0.075440 seconds. % Variant #2: direct assignment into an index just outside the bounds data=[]; tic, for idx=1:100000; data(end+1)=1; end, toc => Elapsed time is 0.241466 seconds. % 3 times slower % Variant #3: concatenating a new value to the array data=[]; tic, for idx=1:100000; data=[data,1]; end, toc => Elapsed time is 22.897688 seconds. % 300 times slower!!! |
As can be seen, it is much faster to directly index an out-of-bounds element as a means to force Matlab to enlarge a data array, rather than using the end+1 notation, which needs to recalculate the value of end each time.
In any case, try to avoid using the concatenation variant, which is significantly slower than either of the other two alternatives (300 times slower in the above example!). In this respect, there is no discernible difference between using the [] operator or the cat() function for the concatenation.
Apparently, the JIT performance boost gained in Matlab R2011a does not work for concatenation. Future JIT improvements may possibly also improve the performance of concatenations, but in the meantime it is better to use direct indexing instead.
The effect of the JIT performance boost is easily seen when we run the same variants on pre-R2011a Matlab releases. The corresponding values are 30.9, 34.8 and 34.3 seconds. Using direct indexing is still the fastest approach, but concatenation is now only 10% slower, not 300 times slower.
When we need to append a non-scalar element (for example, a 2D matrix) to the end of an array, we might think that we have no choice but to use the slow concatenation method. This assumption is incorrect: we can still use the much faster direct-indexing method, as shown below (notice the non-linear growth in execution time for the concatenation variant):
% This is ran on Matlab 7.12 (R2011a): matrix = magic(3); % Variant #1: direct assignment – fast and linear cost data=[]; tic, for idx=1:10000; data(:,(idx*3-2):(idx*3))=matrix; end, toc => Elapsed time is 0.969262 seconds. data=[]; tic, for idx=1:100000; data(:,(idx*3-2):(idx*3))=matrix; end, toc => Elapsed time is 9.558555 seconds. % Variant #2: concatenation – much slower, quadratic cost data=[]; tic, for idx=1:10000; data=[data,matrix]; end, toc => Elapsed time is 2.666223 seconds. data=[]; tic, for idx=1:100000; data=[data,matrix]; end, toc => Elapsed time is 356.567582 seconds. |
As the size of the array enlargement element (in this case, a 3×3 matrix) increases, the computer needs to allocate more memory space more frequently, thereby increasing execution time and the importance of preallocation. Even if the system has an internal memory-management mechanism that enables it to expand into adjacent (contiguous) empty memory space, as the size of the enlargement grows the empty space will run out sooner and a new larger memory block will need to be allocated more frequently than in the case of small incremental enlargements of a single 8-byte double.
Other alternatives
If preallocation is not possible, JIT is not very helpful, vectorization is out of the question, and rewriting the problem so that it doesn’t need dynamic array growth is impossible – if all these are not an option, then consider using one of the following alternatives for array growth (read again the interesting newsgroup thread from 2005 about this issue):
- Dynamically grow the array by a certain percentage factor each time the array runs out of space (on-the-fly preallocation)
- Use John D’Errico’s growdata utility
- Use cell arrays to store and grow the data, then use cell2mat to convert the resulting cell array to a regular numeric array
- Reuse an existing data array that has the necessary storage space
- Wrap the data in a referential object (a class object that inherits from handle), then append the reference handle rather than the original data (ref). Note that if your class object does not inherit from handle, it is not a referential object but rather a value object, and as such it will be appended in its entirety to the array data, losing any performance benefits. Of course, it may not always be possible to wrap our class objects as a handle.
References have a much small memory footprint than the objects that they reference. The objects themselves will remain somewhere in memory and will not need to be moved whenever the data array is enlarged and reallocated – only the small-footprint reference will be moved, which is much faster. This is also the reason that cell concatenation is faster than array concatenations for large objects.
Very interesting post! I wonder though how I will ever remember those issues that only arise so infrequently. I bet by the next time I need to grow an array dynamically I will have forgotten all about this. But maybe the Mathworks JIT team will remember.
The JIT discussion reminds me of a very important question I have had for many years regarding Matlab’s function “warm-up” phenomenon. Everyone knows, I’m sure, that the first time you run a function in Matlab it takes a lot longer than the next run through. There are some obvious reasons for this and quite likely other not obvious reasons. But we all know to “warm up” a function before benchmarking it. The question I have not been able to answer is do we need to worry about warm-up before our production runs on large datasets too. In other words, is the first-run slow-down simply constant overhead or does it grow with problem size is some way? If so, what constitutes a “first” run? First time in a Matlab session? After a
clear all
command? Something else?I guess it’s not very directly related to this post, but I thought Yair might know something about this.
Thanks,
-n
@Naor – sorry, I don’t know for sure. I assume that at least part of the warm-up time is indeed related to the data size, because the data needs to be allocated in memory and then loaded onto the CPU cache. Once it’s in the cache, data access times will be much faster.
Other warm-up factors include function compilation time (which is more-or-less constant), disk caching for the relevant m-files, and Matlab engine being swapped-in from virtual memory and other similar aspects that are dynamic in nature. If there’s any I/O involved in the function, then I/O caching may also come into play here.
Former MathWorker Bill McKeeman says in his Performance Tuning document that he usually throws the initial 3 measurements away when profiling – read his document, it’s illuminating.
@Yair, thanks for the link.
I think there is a small but important difference between what employees of TMW advise and what TMW advises. In general the employees of TMW are clear when they are and are not speaking for TMW. I wish TMW would make more “statements” since employees are obviously very conservative about not revealing IP.
Anyone offering a benchmark suite to check the different gains of all mentioned methods? As explained the actual benefit depends on the Matlab version and I have to use Matlab 2007b …
I tried some on my own and it seems there are differences to the published results.
(I have to use 2007b because the symbolic toolbox changed and I only get errors now with newer versions 🙁 ).
Hello Yair,
I was reading through your posts about memory management and pre-allocation of arrays and found them very interesting. Currently I am stuck in a very peculiar situation. My input data file is a .mat file of size 52 MB. I load the file in MATLAB using the load command. I scan through the file and extract my data points out of this. I get 11000 rows and each row has 4090 columns. I preallocate a large array using zeros command. I know my data type is double and it uses 8 bytes. Total bytes used for above array is 200 MB. For further processing of my data I have to allocate 2 more arrays of the same previous rows and columns.
So while doing this I run out of memory how can I free up the previous memory allocated and how can I solve this problem.
@Ricky – 11K x 4090 x 8 = ~343 MB (not 200 MB as you said). And this memory needs to be contiguous, since its a regular numeric matrix. Even if you have this much memory available, it is quite possible that the largest contiguous block of memory is smaller than 343 MB. If you need to do this for 2 other arrays of the same size, you only make things worse…