Array preallocation is a standard and quite well-known technique for improving Matlab loop run-time performance. Today’s article will show that there is more than meets the eye for even such a simple coding technique.
A note of caution: in the examples that follow, don’t take any speedup as an expected actual value – the actual value may well be different on your system. Your mileage may vary. I only mean to display the relative differences between different alternatives.
The underlying problem
Memory management has a direct influence on performance. I have already shown some examples of this in past articles here.
Preallocation solves a basic problem in simple program loops, where an array is iteratively enlarged with new data (dynamic array growth). Unlike other programming languages (such as C, C++, C# or Java) that use static typing, Matlab uses dynamic typing. This means that it is natural and easy to modify array size dynamically during program execution. For example:
fibonacci = [0, 1]; for idx = 3 : 100 fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2); end |
While this may be simple to program, it is not wise with regards to performance. The reason is that whenever an array is resized (typically enlarged), Matlab allocates an entirely new contiguous block of memory for the array, copying the old values from the previous block to the new, then releasing the old block for potential reuse. This operation takes time to execute. In some cases, this reallocation might require accessing virtual memory and page swaps, which would take an even longer time to execute. If the operation is done in a loop, then performance could quickly drop off a cliff.
The cost of such naïve array growth is theoretically quadratic. This means that multiplying the number of elements by N multiplies the execution time by about N2. The reason for this is that Matlab needs to reallocate N times more than before, and each time takes N times longer due to the larger allocation size (the average block size multiplies by N), and N times more data elements to copy from the old to the new memory blocks.
A very interesting discussion of this phenomenon and various solutions can be found in a newsgroup thread from 2005. Three main solutions were presented: preallocation, selective dynamic growth (allocating headroom) and using cell arrays. The best solution among these in terms of ease of use and performance is preallocation.
The basics of pre-allocation
The basic idea of preallocation is to create a data array in the final expected size before actually starting the processing loop. This saves any reallocations within the loop, since all the data array elements are already available and can be accessed. This solution is useful when the final size is known in advance, as the following snippet illustrates:
% Regular dynamic array growth: tic fibonacci = [0,1]; for idx = 3 : 40000 fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2); end toc => Elapsed time is 0.019954 seconds. % Now use preallocation – 5 times faster than dynamic array growth: tic fibonacci = zeros(40000,1); fibonacci(1)=0; fibonacci(2)=1; for idx = 3 : 40000, fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2); end toc => Elapsed time is 0.004132 seconds. |
On pre-R2011a releases the effect of preallocation is even more pronounced: I got a 35-times speedup on the same machine using Matlab 7.1 (R14 SP3). R2011a (Matlab 7.12) had a dramatic performance boost for such cases in the internal accelerator, so newer releases are much faster in dynamic allocations, but preallocation is still 5 times faster even on R2011a.
Non-deterministic pre-allocation
Because the effect of preallocation is so dramatic on all Matlab releases, it makes sense to utilize it even in cases where the data array’s final size is not known in advance. We can do this by estimating an upper bound to the array’s size, preallocate this large size, and when we’re done remove any excess elements:
% The final array size is unknown – assume 1Kx3K upper bound (~23MB) data = zeros(1000,3000); % estimated maximal size numRows = 0; numCols = 0; while (someCondition) colIdx = someValue1; numCols = max(numCols,colIdx); rowIdx = someValue2; numRows = max(numRows,rowIdx); data(rowIdx,colIdx) = someOtherValue; end % Now remove any excess elements data(:,numCols+1:end) = []; % remove excess columns data(numRows+1:end,:) = []; % remove excess rows |
Variants for pre-allocation
It turns out that MathWorks’ official suggestion for preallocation, namely using the zeros function, is not the most efficient:
% MathWorks suggested variant clear data1, tic, data1 = zeros(1000,3000); toc => Elapsed time is 0.016907 seconds. % A much faster alternative - 500 times faster! clear data1, tic, data1(1000,3000) = 0; toc => Elapsed time is 0.000034 seconds. |
The reason for the second variant being so much faster is because it only allocates the memory, without worrying about the internal values (they get a default of 0, false or ”, in case you wondered). On the other hand, zeros has to place a value in each of the allocated locations, which takes precious time.
In most cases the differences are immaterial since the preallocation code would only run once in the program, and an extra 17ms isn’t such a big deal. But in some cases we may have a need to periodically refresh our data, where the extra run-time could quickly accumulate.
Update (October 27, 2015): As Marshall notes below, this behavior changed in R2015b when the new LXE (Matlab’s new execution engine) replaced the previous engine. In R2015b, the zeros function is faster than the alternative of just setting the last array element to 0. Similar changes may also have occurred to the following post content, so if you are using R2015b onward, be sure to test carefully on your specific system.
Pre-allocating non-default values
When we need to preallocate a specific value into every data array element, we cannot use Variant #2. The reason is that Variant #2 only sets the very last data element, and all other array elements get assigned the default value (0, ‘’ or false, depending on the array’s data type). In this case, we can use one of the following alternatives (with their associated timings for a 1000×3000 data array):
scalar = pi; % for example... data = scalar(ones(1000,3000)); % Variant A: 87.680 msecs data(1:1000,1:3000) = scalar; % Variant B: 28.646 msecs data = repmat(scalar,1000,3000); % Variant C: 17.250 msecs data = scalar + zeros(1000,3000); % Variant D: 17.168 msecs data(1000,3000) = 0; data = data+scalar; % Variant E: 16.334 msecs |
As can be seen, Variants C-E are about twice as fast as Variant B, and 5 times faster than Variant A.
Pre-allocating non-double data
7.4.5 Preallocating non-double data
When preallocating an array of a type that is not double, we should be careful to create it using the desired type, to prevent memory and/or performance inefficiencies. For example, if we need to process a large array of small integers (int8), it would be inefficient to preallocate an array of doubles and type-convert to/from int8 within every loop iteration. Similarly, it would be inefficient to preallocate the array as a double type and then convert it to int8. Instead, we should create the array as an int8 array in the first place:
% Bad idea: allocates 8MB double array, then converts to 1MB int8 array data = int8(zeros(1000,1000)); % 1M elements => Elapsed time is 0.008170 seconds. % Better: directly allocate the array as a 1MB int8 array – x80 faster data = zeros(1000,1000,'int8'); => Elapsed time is 0.000095 seconds. |
Pre-allocating cell arrays
To preallocate a cell-array we can use the cell function (explicit preallocation), or the maximal cell index (implicit preallocation). Explicit preallocation is faster than implicit preallocation, but functionally equivalent (Note: this is contrary to the experience with allocation of numeric arrays and other arrays):
% Variant #1: Explicit preallocation of a 1Kx3K cell array data = cell(1000,3000); => Elapsed time is 0.004637 seconds. % Variant #2: Implicit preallocation – x3 slower than explicit clear('data'), data{1000,3000} = []; => Elapsed time is 0.012873 seconds. |
Pre-allocating arrays of structs
To preallocate an array of structs or class objects, we can use the repmat function to replicate copies of a single data element (explicit preallocation), or just use the maximal data index (implicit preallocation). In this case, unlike the case of cell arrays, implicit preallocation is much faster than explicit preallocation, since the single element does not actually need to be copied multiple times (ref):
% Variant #1: Explicit preallocation of a 100x300 struct array element = struct('field1',magic(2), 'field2',{[]}); data = repmat(element, 100, 300); => Elapsed time is 0.002804 seconds. % Variant #2: Implicit preallocation – x7 faster than explicit element = struct('field1',magic(2), 'field2',{[]}); clear('data'), data(100,300) = element; => Elapsed time is 0.000429 seconds. |
When preallocating structs, we can also use a third variant, using the built-in struct feature of replicating the struct when the struct function is passed a cell array. For example, struct('field1',cell(100,1), 'field2',5)
will create 100 structs, each of them having the empty field field1 and another field called field2 with value 5. Unfortunately, this variant is slower than both of the previous variants.
Pre-allocating class objects
When preallocating in general, ensure that you are using the maximal expected array size. There is no point in preallocating an empty array or an array having a smaller size than the expected maximum, since dynamic memory reallocation will automatically kick-in within the processing-loop. For this reason, do not use the empty() method of class objects to preallocate, but rather repmat as explained above.
When using repmat to replicate class objects, always be careful to note whether you are replicating the object itself (this happens if your class does NOT derive from handle) or its reference handle (which happens if you derive the class from handle). If you are replicating objects, then you can safely edit any of their properties independently of each other; but if you replicate references, you are merely using multiple copies of the same reference, so that modifying referenced object #1 will also automatically affect all the other referenced objects. This may or may not be suitable for your particular program requirements, so be careful to check carefully. If you actually need to use independent object copies, you will need to call the class constructor multiple times, once for each new independent object.
Do you have any similar allocation-related tricks you’re using? or unexpected differences such as the ones shown above? If so, then please do post a comment.
It turns out that MathWorks’ official suggestion for preallocation, namely using the zeros function, is not the most efficient:
Doesn’t work for me (Matlab 2011a on Mac OS 10.6.8):
For “Pre-allocating non-default values”, my timings mostly agree, except that C and D are the other way round:
A 0.076208 seconds.
B 0.010146 seconds.
C 0.024495 seconds.
D 0.014459 seconds.
E 0.012212 seconds.
The other examples all worked the same for me as in the post.
This is strange! Can you explain why the 2nd variant is only 100 faster the first time around? Only after a ‘clear all’ you get a fast execution again.
@Chris & Richard: perhaps this is platform or situation dependent. On my Windows XP running both R2010b and R2012a,
data1=zeros(1000,3000)
is consistently hundreds of times slower thandata1(1000,3000)=0
.Perhaps you are running this on a “warm” Matlab session that is running on fragmented memory? Try running on a new Matlab session that does not clash for system memory with other heavy processes (e.g. a multi-tab browser).
greetings!
“@Chris & Richard: perhaps this is platform or situation dependent. On my Windows XP running both R2010b and R2012a, data1=zeros(1000,3000) is consistently hundreds of times faster than data1(1000,3000)=0.”
now I’m confused: did you mean slower, by any chance?
cheers,
pedro
@Pedro – you’re right, it was a typo (now corrected) – thanks 🙂
I can recreate Yair’s differences between
data1 = zeros(1000,3000);
anddata2(1000,3000) = 0;
on 32-bit Windows XP 32 with R2010a, but not on 64-bit Linux with R2011a. So it looks like it might be OS or 32/64 bit dependent.But I wanted to explore a little more so I decided to up the size of the matrix so it would take more time to compute in Linux
It appears it is faster to preallocate a 1e4 x 1e4 (800MB) matrix than a 1e3 x 1e3 (8MB) matrix.
Of course it’s good to profile first, then use this technique if warranted. Otherwise, write clear, simple code that’s easy to reason about. I usually shut off this particular Code Analyzer warning because it’s almost never been an important bottleneck for me. YMMV
I run on windows 7, Matlab R2010a. My Matlab has been doing things for 2 days and was release from heavy duty 5 minutes ago. This are my results:
Dear Yair,
in your nice post you used the code
I’ve never seen this way of using a scalar, mapping it to an array. Is there any passage in the documentation you could cite or explanation why and how precisely this works?
Alexander
@Alexander – using indexing to repeat array elements is such a standard technique for me that I didn’t even think twice about the fact that it may not be well known… This technique is used numerous times in the Matlab code corpus, although now that you asked I couldn’t find a direct doc reference for it (there probably is some reference, I just couldn’t find it).
Here is an unofficial article that explains the use.
Dear Yair,
thanks for the link. Indeed I’ve never seen this way of indexing befor and I’m sure I’ll use it in many places.
The main step to understand seems to be to think of scalars being implemented as arrays with a singleton dimension. Then everything becomes clear. Thinks could be so easy …
Thanks
[…] Since I discussed related topics in the past two weeks (preallocation performance, array resizing performance), these internal optimizations seemed a natural topic for today […]
[…] from UndocumentedMatlab recently introduced a very nice trick to boost this even further. It turns out that most of the time spent in the function zeros is […]
For “Variants for pre-allocation”:
Using zeros is about as fast for me, as long as you specify the class, e.g.:
is comparable on my machine to setting z(10000,10000) = 0
(Elapsed time is 0.000030 seconds.)
Thanks for putting this information up. To comment on the timings, I think they are too small to be measured correctly this way. You should run these commands in a loop for a couple thousand times to get timings into seconds, otherwise they will be influenced by context switching of your operating system.
I doubt it matters if you use the zeroes command or direct indexing. As you say, the array gets initialized with zeroes either way (FALSE or ” are zeroes as well). Using different data types might cause a smaller number of bytes to be allocated and a smaller initialization time.
Dear Yair,
Thanks for this fascinating post.
Incidentally, after noticing that empty matrix multiplication also produces a matrix zeros:
I’ve discovered that this is another way to improve pre-allocation performance. For example,
vs:
That improvement is array size dependent and only happens above some matrix size, more info can be seen in a StackOverflow post I’ve recently posted.
@Nathan – Thanks for the update. I think that
a(1000,1000)=0;
is simpler to use. IMHO, the potential minor extra saving by the product trick is not worth the extra complexity. Still, I admit it’s a very neat trick.[…] I used my finding that zeros(n,m)+scalar is often faster than ones(n,m)*scalar (see my article on pre-allocation performance). Applying this to Mike’s powers-vector example, zeros(n,m)+scalar only gave me a 25% […]
[…] (or demand-driven) evaluation.For additional aspects of preallocation performance, refer to my article from last year, which discusses class objects only briefly, but expands on the other data types.London training […]
Hi, thanks for your post, very useful. I tried some stuff, and I found this :
From data(1,1) to data(2346,2346), time is increasing linearly, and when reaching 2347×2347, execution time falls : 100 time faster. I’m wondering why is 2346 a “limit” … Weird !
Thanks again for your posts 🙂
Read this related post: http://undocumentedmatlab.com/blog/allocation-performance-take-2
[…] doing some research, I’ve found this article in “Undocumented Matlab”, in which Mr. Yair Altman had already come to the conclusion […]
[…] both these approaches use the same hacky technique to pre-allocate as listed in Undocumented MATLAB and also listed in the other answer by @rayryeng. On top of it, it uses a raw version of […]
Hi Yair,
I’m wondering if Mathworks have changed zeros() in the more recent revisions. I’m currently seeing zeros() as being significantly faster than the end-referencing technique, for example:
@Marshall – indeed: the behavior changed in R2015b when the new LXE (Matlab’s new execution engine) replaced the previous engine.
I can replicate this version difference since 2015b. The following observation may be interesting.
If I open Windows Task Manager or Linux System Monitor to watch for memory usage, whenever the allocation (Matlab versions and zeros /end-referencing combinations) is slower, it involves increased memory usage, while the faster one does not. This suggests the faster one is using pre-allocated memory when Matlab starts. Any idea to take advantage of this? Thanks.
I just upgraded from R2015a to R2015b and a MATLAB script I have that used to consume about 2.1 GB now consumes 4.6 GB of memory. That’s a big problem for me because I used to be able to run three copies of MATLAB in parallel with different parameter settings and now I can only run one on my 8 GB, 4 processor machine. You mentioned a new LXE above, and I’m wondering if that is the culprit and if anything can be done about it. If it’s faster than the old version, that’s great, but not if it uses way more memory.
@Darrell, thank you for bringing this to our attention!
Would it be possible for us to take a look at your code?
Would you be open to connecting with us offline to discuss your use case? Feel free to email Anoush dot Najarian at mathworks dot com.
[…] Pre-allocation is usually recommended as a fix: that is, size the entire array in advance, before assigning any elements. But what if you don’t know the size of the array in advance? Although there are several approaches to dealing with this situation, with varying complexity, the subject of this post is to describe just how much execution time may be eliminated by only a modest change to the above code. […]
[…] Preallocation performance […]