Preallocation performance

May 16, 2012 30 Comments

Array preallocation is a standard and quite well-known technique for improving Matlab loop run-time performance. Today’s article will show that there is more than meets the eye for even such a simple coding technique.
A note of caution: in the examples that follow, don’t take any speedup as an expected actual value – the actual value may well be different on your system. Your mileage may vary. I only mean to display the relative differences between different alternatives.

The underlying problem

Memory management has a direct influence on performance. I have already shown some examples of this in past articles here.
Preallocation solves a basic problem in simple program loops, where an array is iteratively enlarged with new data (dynamic array growth). Unlike other programming languages (such as C, C++, C# or Java) that use static typing, Matlab uses dynamic typing. This means that it is natural and easy to modify array size dynamically during program execution. For example:

fibonacci = [0, 1];
for idx = 3 : 100
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end

While this may be simple to program, it is not wise with regards to performance. The reason is that whenever an array is resized (typically enlarged), Matlab allocates an entirely new contiguous block of memory for the array, copying the old values from the previous block to the new, then releasing the old block for potential reuse. This operation takes time to execute. In some cases, this reallocation might require accessing virtual memory and page swaps, which would take an even longer time to execute. If the operation is done in a loop, then performance could quickly drop off a cliff.
The cost of such naïve array growth is theoretically quadratic. This means that multiplying the number of elements by N multiplies the execution time by about N². The reason for this is that Matlab needs to reallocate N times more than before, and each time takes N times longer due to the larger allocation size (the average block size multiplies by N), and N times more data elements to copy from the old to the new memory blocks.
A very interesting discussion of this phenomenon and various solutions can be found in a newsgroup thread from 2005. Three main solutions were presented: preallocation, selective dynamic growth (allocating headroom) and using cell arrays. The best solution among these in terms of ease of use and performance is preallocation.

The basics of pre-allocation

The basic idea of preallocation is to create a data array in the final expected size before actually starting the processing loop. This saves any reallocations within the loop, since all the data array elements are already available and can be accessed. This solution is useful when the final size is known in advance, as the following snippet illustrates:

% Regular dynamic array growth:
tic
fibonacci = [0,1];
for idx = 3 : 40000
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end
toc
   => Elapsed time is 0.019954 seconds.
% Now use preallocation – 5 times faster than dynamic array growth:
tic
fibonacci = zeros(40000,1);
fibonacci(1)=0; fibonacci(2)=1;
for idx = 3 : 40000,
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end
toc
   => Elapsed time is 0.004132 seconds.

On pre-R2011a releases the effect of preallocation is even more pronounced: I got a 35-times speedup on the same machine using Matlab 7.1 (R14 SP3). R2011a (Matlab 7.12) had a dramatic performance boost for such cases in the internal accelerator, so newer releases are much faster in dynamic allocations, but preallocation is still 5 times faster even on R2011a.

Non-deterministic pre-allocation

Because the effect of preallocation is so dramatic on all Matlab releases, it makes sense to utilize it even in cases where the data array’s final size is not known in advance. We can do this by estimating an upper bound to the array’s size, preallocate this large size, and when we’re done remove any excess elements:

% The final array size is unknown – assume 1Kx3K upper bound (~23MB)
data = zeros(1000,3000);  % estimated maximal size
numRows = 0;
numCols = 0;
while (someCondition)
   colIdx = someValue1;   numCols = max(numCols,colIdx);
   rowIdx = someValue2;   numRows = max(numRows,rowIdx);
   data(rowIdx,colIdx) = someOtherValue;
end
% Now remove any excess elements
data(:,numCols+1:end) = [];   % remove excess columns
data(numRows+1:end,:) = [];   % remove excess rows

Variants for pre-allocation

It turns out that MathWorks’ official suggestion for preallocation, namely using the zeros function, is not the most efficient:

% MathWorks suggested variant
clear data1, tic, data1 = zeros(1000,3000); toc
   => Elapsed time is 0.016907 seconds.
% A much faster alternative - 500 times faster!
clear data1, tic, data1(1000,3000) = 0; toc
   => Elapsed time is 0.000034 seconds.

The reason for the second variant being so much faster is because it only allocates the memory, without worrying about the internal values (they get a default of 0, false or ”, in case you wondered). On the other hand, zeros has to place a value in each of the allocated locations, which takes precious time.
In most cases the differences are immaterial since the preallocation code would only run once in the program, and an extra 17ms isn’t such a big deal. But in some cases we may have a need to periodically refresh our data, where the extra run-time could quickly accumulate.
Update (October 27, 2015): As Marshall notes below, this behavior changed in R2015b when the new LXE (Matlab’s new execution engine) replaced the previous engine. In R2015b, the zeros function is faster than the alternative of just setting the last array element to 0. Similar changes may also have occurred to the following post content, so if you are using R2015b onward, be sure to test carefully on your specific system.

Pre-allocating non-default values

When we need to preallocate a specific value into every data array element, we cannot use Variant #2. The reason is that Variant #2 only sets the very last data element, and all other array elements get assigned the default value (0, ‘’ or false, depending on the array’s data type). In this case, we can use one of the following alternatives (with their associated timings for a 1000×3000 data array):

scalar = pi;  % for example...
data = scalar(ones(1000,3000));           % Variant A: 87.680 msecs
data(1:1000,1:3000) = scalar;             % Variant B: 28.646 msecs
data = repmat(scalar,1000,3000);          % Variant C: 17.250 msecs
data = scalar + zeros(1000,3000);         % Variant D: 17.168 msecs
data(1000,3000) = 0; data = data+scalar;  % Variant E: 16.334 msecs

As can be seen, Variants C-E are about twice as fast as Variant B, and 5 times faster than Variant A.

Pre-allocating non-double data

7.4.5 Preallocating non-double data
When preallocating an array of a type that is not double, we should be careful to create it using the desired type, to prevent memory and/or performance inefficiencies. For example, if we need to process a large array of small integers (int8), it would be inefficient to preallocate an array of doubles and type-convert to/from int8 within every loop iteration. Similarly, it would be inefficient to preallocate the array as a double type and then convert it to int8. Instead, we should create the array as an int8 array in the first place:

% Bad idea: allocates 8MB double array, then converts to 1MB int8 array
data = int8(zeros(1000,1000));   % 1M elements
   => Elapsed time is 0.008170 seconds.
% Better: directly allocate the array as a 1MB int8 array – x80 faster
data = zeros(1000,1000,'int8');
   => Elapsed time is 0.000095 seconds.

Pre-allocating cell arrays

To preallocate a cell-array we can use the cell function (explicit preallocation), or the maximal cell index (implicit preallocation). Explicit preallocation is faster than implicit preallocation, but functionally equivalent (Note: this is contrary to the experience with allocation of numeric arrays and other arrays):

% Variant #1: Explicit preallocation of a 1Kx3K cell array
data = cell(1000,3000);
   => Elapsed time is 0.004637 seconds.
% Variant #2: Implicit preallocation – x3 slower than explicit
clear('data'), data{1000,3000} = [];
   => Elapsed time is 0.012873 seconds.

Pre-allocating arrays of structs

To preallocate an array of structs or class objects, we can use the repmat function to replicate copies of a single data element (explicit preallocation), or just use the maximal data index (implicit preallocation). In this case, unlike the case of cell arrays, implicit preallocation is much faster than explicit preallocation, since the single element does not actually need to be copied multiple times (ref):

% Variant #1: Explicit preallocation of a 100x300 struct array
element = struct('field1',magic(2), 'field2',{[]});
data = repmat(element, 100, 300);
   => Elapsed time is 0.002804 seconds.
% Variant #2: Implicit preallocation – x7 faster than explicit
element = struct('field1',magic(2), 'field2',{[]});
clear('data'), data(100,300) = element;
   => Elapsed time is 0.000429 seconds.

When preallocating structs, we can also use a third variant, using the built-in struct feature of replicating the struct when the struct function is passed a cell array. For example, struct('field1',cell(100,1), 'field2',5) will create 100 structs, each of them having the empty field field1 and another field called field2 with value 5. Unfortunately, this variant is slower than both of the previous variants.

Pre-allocating class objects

When preallocating in general, ensure that you are using the maximal expected array size. There is no point in preallocating an empty array or an array having a smaller size than the expected maximum, since dynamic memory reallocation will automatically kick-in within the processing-loop. For this reason, do not use the empty() method of class objects to preallocate, but rather repmat as explained above.
When using repmat to replicate class objects, always be careful to note whether you are replicating the object itself (this happens if your class does NOT derive from handle) or its reference handle (which happens if you derive the class from handle). If you are replicating objects, then you can safely edit any of their properties independently of each other; but if you replicate references, you are merely using multiple copies of the same reference, so that modifying referenced object #1 will also automatically affect all the other referenced objects. This may or may not be suitable for your particular program requirements, so be careful to check carefully. If you actually need to use independent object copies, you will need to call the class constructor multiple times, once for each new independent object.

Next week: what if we can’t avoid dynamic array resizing? – apparently, all is not lost. Stay tuned…

Do you have any similar allocation-related tricks you’re using? or unexpected differences such as the ones shown above? If so, then please do post a comment.

30 Responses

Richard Kennaway May 16, 2012 at 05:53 Reply

It turns out that MathWorks’ official suggestion for preallocation, namely using the zeros function, is not the most efficient:

Doesn’t work for me (Matlab 2011a on Mac OS 10.6.8):

>> clear data1, tic, data1 = zeros(1000,3000); toc
Elapsed time is 0.005574 seconds.
>> clear data1, tic, data1(1000,3000) = 0; toc
Elapsed time is 0.005813 seconds.
>> clear data1, tic, data1 = zeros(1000,3000); toc
Elapsed time is 0.005598 seconds.
>> clear data1, tic, data1(1000,3000) = 0; toc
Elapsed time is 0.005608 seconds.

For “Pre-allocating non-default values”, my timings mostly agree, except that C and D are the other way round:

A 0.076208 seconds.
B 0.010146 seconds.
C 0.024495 seconds.
D 0.014459 seconds.
E 0.012212 seconds.

The other examples all worked the same for me as in the post.

Cris May 16, 2012 at 07:25 Reply

This is strange! Can you explain why the 2nd variant is only 100 faster the first time around? Only after a ‘clear all’ you get a fast execution again.

>> clear all
>> clear data1, tic, data1 = zeros(1000,3000); toc
Elapsed time is 0.020362 seconds.
>> clear data1, tic, data1 = zeros(1000,3000); toc
Elapsed time is 0.020228 seconds.
>> clear data1, tic, data1 = zeros(1000,3000); toc
Elapsed time is 0.018324 seconds.
>> clear data2, tic, data2(1000,3000) = 0; toc
Elapsed time is 0.000206 seconds.
>> clear data2, tic, data2(1000,3000) = 0; toc
Elapsed time is 0.022243 seconds.
>> clear data2, tic, data2(1000,3000) = 0; toc
Elapsed time is 0.019689 seconds.
>> clear data2, tic, data2(1000,3000) = 0; toc
Elapsed time is 0.020465 seconds.

Yair Altman May 16, 2012 at 09:39 Reply

@Chris & Richard: perhaps this is platform or situation dependent. On my Windows XP running both R2010b and R2012a, data1=zeros(1000,3000) is consistently hundreds of times slower than data1(1000,3000)=0.

Perhaps you are running this on a “warm” Matlab session that is running on fragmented memory? Try running on a new Matlab session that does not clash for system memory with other heavy processes (e.g. a multi-tab browser).

pedro May 24, 2012 at 02:07

greetings!
“@Chris & Richard: perhaps this is platform or situation dependent. On my Windows XP running both R2010b and R2012a, data1=zeros(1000,3000) is consistently hundreds of times faster than data1(1000,3000)=0.”
now I’m confused: did you mean slower, by any chance?
cheers,
pedro
Yair Altman May 24, 2012 at 03:10

@Pedro – you’re right, it was a typo (now corrected) – thanks 🙂

Daniel May 17, 2012 at 05:14 Reply

I can recreate Yair’s differences between data1 = zeros(1000,3000); and data2(1000,3000) = 0; on 32-bit Windows XP 32 with R2010a, but not on 64-bit Linux with R2011a. So it looks like it might be OS or 32/64 bit dependent.

But I wanted to explore a little more so I decided to up the size of the matrix so it would take more time to compute in Linux

>> clear data1, tic, data1(1e4,1e4) = 0; toc
Elapsed time is 0.000044 seconds.
>> clear data1, tic, data1(1e3,1e3) = 0; toc
Elapsed time is 0.005977 seconds.
>> clear data1, tic, data1(1e4,1e4) = 0; toc
Elapsed time is 0.000040 seconds.
>> clear data1, tic, data1(1e3,1e3) = 0; toc
Elapsed time is 0.006443 seconds.
>> clear data1, tic, data1(1e4,1e4) = 0; toc
Elapsed time is 0.000030 seconds.
>> clear data1, tic, data1(1e3,1e3) = 0; toc
Elapsed time is 0.006265 seconds.

It appears it is faster to preallocate a 1e4 x 1e4 (800MB) matrix than a 1e3 x 1e3 (8MB) matrix.

Marc May 16, 2012 at 14:03 Reply

Of course it’s good to profile first, then use this technique if warranted. Otherwise, write clear, simple code that’s easy to reason about. I usually shut off this particular Code Analyzer warning because it’s almost never been an important bottleneck for me. YMMV

Héctor May 18, 2012 at 06:56 Reply

I run on windows 7, Matlab R2010a. My Matlab has been doing things for 2 days and was release from heavy duty 5 minutes ago. This are my results:

>> clear data1, tic, data1 = zeros(1000,3000); toc
Elapsed time is 0.007353 seconds.
>> clear data1, tic, data1(1000,3000) = 0; toc
Elapsed time is 0.000027 seconds.
>> clear data1, tic, data1 = zeros(1000,3000); toc
Elapsed time is 0.007176 seconds.
>> clear data1, tic, data1(1000,3000) = 0; toc
Elapsed time is 0.000023 seconds.

Alexander May 20, 2012 at 23:44 Reply
Dear Yair,

in your nice post you used the code
scalar = pi; % for example... data = scalar(ones(1000,3000)); % Variant A: 87.680 msecs sum(a)
scalar = pi; % for example... data = scalar(ones(1000,3000)); % Variant A: 87.680 msecs sum(a)
I’ve never seen this way of using a scalar, mapping it to an array. Is there any passage in the documentation you could cite or explanation why and how precisely this works?

Alexander

Yair Altman May 21, 2012 at 00:20 Reply

@Alexander – using indexing to repeat array elements is such a standard technique for me that I didn’t even think twice about the fact that it may not be well known… This technique is used numerous times in the Matlab code corpus, although now that you asked I couldn’t find a direct doc reference for it (there probably is some reference, I just couldn’t find it).

Here is an unofficial article that explains the use.

Alexander May 21, 2012 at 02:59

Dear Yair,

thanks for the link. Indeed I’ve never seen this way of indexing befor and I’m sure I’ll use it in many places.

The main step to understand seems to be to think of scalars being implemented as arrays with a singleton dimension. Then everything becomes clear. Thinks could be so easy …

Thanks

Internal Matlab memory optimizations | Undocumented Matlab May 30, 2012 at 05:09 Reply

[…] Since I discussed related topics in the past two weeks (preallocation performance, array resizing performance), these internal optimizations seemed a natural topic for today […]
preallocation is not an option | Matlabtips.com June 25, 2012 at 22:29 Reply

[…] from UndocumentedMatlab recently introduced a very nice trick to boost this even further. It turns out that most of the time spent in the function zeros is […]
Troy December 7, 2012 at 08:19 Reply
For “Variants for pre-allocation”:

Using zeros is about as fast for me, as long as you specify the class, e.g.:
>> clear z; tic; z=zeros(10000,'double'); toc Elapsed time is 0.000034 seconds.
>> clear z; tic; z=zeros(10000,'double'); toc Elapsed time is 0.000034 seconds.
is comparable on my machine to setting z(10000,10000) = 0
(Elapsed time is 0.000030 seconds.)
Mike December 17, 2012 at 06:46 Reply

Thanks for putting this information up. To comment on the timings, I think they are too small to be measured correctly this way. You should run these commands in a loop for a couple thousand times to get timings into seconds, otherwise they will be influenced by context switching of your operating system.

I doubt it matters if you use the zeroes command or direct indexing. As you say, the array gets initialized with zeroes either way (FALSE or ” are zeroes as well). Using different data types might cause a smaller number of bytes to be allocated and a smaller initialization time.
natan January 7, 2013 at 13:35 Reply
Dear Yair,
Thanks for this fascinating post.
Incidentally, after noticing that empty matrix multiplication also produces a matrix zeros:
zeros(3,0)*zeros(0,3) ans = 0 0 0 0 0 0 0 0 0
zeros(3,0)*zeros(0,3) ans = 0 0 0 0 0 0 0 0 0
I’ve discovered that this is another way to improve pre-allocation performance. For example,
f=@() zeros(1000) timeit(f) ans = 0.0033
f=@() zeros(1000) timeit(f) ans = 0.0033
vs:
g=@() zeros(1000,0)*zeros(0,1000) timeit(g) ans = 9.2048e-06
g=@() zeros(1000,0)*zeros(0,1000) timeit(g) ans = 9.2048e-06
That improvement is array size dependent and only happens above some matrix size, more info can be seen in a StackOverflow post I’ve recently posted.

Yair Altman January 7, 2013 at 14:07 Reply

@Nathan – Thanks for the update. I think that a(1000,1000)=0; is simpler to use. IMHO, the potential minor extra saving by the product trick is not worth the extra complexity. Still, I admit it’s a very neat trick.

Allocation performance take 2 | Undocumented Matlab August 14, 2013 at 11:01 Reply

[…] I used my finding that zeros(n,m)+scalar is often faster than ones(n,m)*scalar (see my article on pre-allocation performance). Applying this to Mike’s powers-vector example, zeros(n,m)+scalar only gave me a 25% […]
Class object creation performance | Undocumented Matlab December 13, 2013 at 03:12 Reply

[…] (or demand-driven) evaluation.For additional aspects of preallocation performance, refer to my article from last year, which discusses class objects only briefly, but expands on the other data types.London training […]
hackndo May 4, 2015 at 08:15 Reply
Hi, thanks for your post, very useful. I tried some stuff, and I found this :
>> clear data, tic, data(2346,2346) = 0; toc Elapsed time is 0.007001 seconds. >> clear data, tic, data(2347,2347) = 0; toc Elapsed time is 0.000065 seconds.
>> clear data, tic, data(2346,2346) = 0; toc Elapsed time is 0.007001 seconds. >> clear data, tic, data(2347,2347) = 0; toc Elapsed time is 0.000065 seconds.
From data(1,1) to data(2346,2346), time is increasing linearly, and when reaching 2347×2347, execution time falls : 100 time faster. I’m wondering why is 2346 a “limit” … Weird !

Thanks again for your posts 🙂

Yair Altman May 9, 2015 at 21:57 Reply

Read this related post: http://undocumentedmatlab.com/blog/allocation-performance-take-2

Faster way to initialize arrays via empty matrix multiplication? (Matlab) - DexPage July 18, 2015 at 09:52 Reply

[…] doing some research, I’ve found this article in “Undocumented Matlab”, in which Mr. Yair Altman had already come to the conclusion […]
Pivot to binary matrix from categorial array - DexPage July 25, 2015 at 10:25 Reply

[…] both these approaches use the same hacky technique to pre-allocate as listed in Undocumented MATLAB and also listed in the other answer by @rayryeng. On top of it, it uses a raw version of […]
Marshall October 27, 2015 at 08:52 Reply
Hi Yair,

I’m wondering if Mathworks have changed zeros() in the more recent revisions. I’m currently seeing zeros() as being significantly faster than the end-referencing technique, for example:
>> clear x; tic; x(1e4,1e4)=0; toc; Elapsed time is 0.189022 seconds. >> clear x; tic; x=zeros(1e4,1e4); toc; Elapsed time is 0.000471 seconds.
>> clear x; tic; x(1e4,1e4)=0; toc; Elapsed time is 0.189022 seconds. >> clear x; tic; x=zeros(1e4,1e4); toc; Elapsed time is 0.000471 seconds.

Yair Altman October 27, 2015 at 09:45 Reply

@Marshall – indeed: the behavior changed in R2015b when the new LXE (Matlab’s new execution engine) replaced the previous engine.
Xiangrui Li November 1, 2016 at 16:44 Reply

I can replicate this version difference since 2015b. The following observation may be interesting.
If I open Windows Task Manager or Linux System Monitor to watch for memory usage, whenever the allocation (Matlab versions and zeros /end-referencing combinations) is slower, it involves increased memory usage, while the faster one does not. This suggests the faster one is using pre-allocated memory when Matlab starts. Any idea to take advantage of this? Thanks.

Darrell October 28, 2015 at 10:52 Reply

I just upgraded from R2015a to R2015b and a MATLAB script I have that used to consume about 2.1 GB now consumes 4.6 GB of memory. That’s a big problem for me because I used to be able to run three copies of MATLAB in parallel with different parameter settings and now I can only run one on my 8 GB, 4 processor machine. You mentioned a new LXE above, and I’m wondering if that is the culprit and if anything can be done about it. If it’s faster than the old version, that’s great, but not if it uses way more memory.

Anoush Najarian November 5, 2015 at 21:35 Reply

@Darrell, thank you for bringing this to our attention!

Would it be possible for us to take a look at your code?

Would you be open to connecting with us offline to discuss your use case? Feel free to email Anoush dot Najarian at mathworks dot com.

Array resizing in MATLAB | Possibly Wrong February 8, 2017 at 17:40 Reply

[…] Pre-allocation is usually recommended as a fix: that is, size the entire array in advance, before assigning any elements. But what if you don’t know the size of the array in advance? Although there are several approaches to dealing with this situation, with varying complexity, the subject of this post is to describe just how much execution time may be eliminated by only a modest change to the above code. […]
变量出现在每个循环迭代的大小改变-什么？ – CodingBlog May 21, 2017 at 18:00 Reply

[…] Preallocation performance […]

HTML tags such as <b> or <i> are accepted.
Wrap code fragments inside <pre lang="matlab"> tags, like this:

<pre lang="matlab">
a = magic(3);
disp(sum(a))
</pre>

I reserve the right to edit/delete comments (read the site policies).
Not all comments will be answered. You can always email me (altmany at gmail) for private consulting.

Click here to cancel reply.