Preallocation performance

Array preallocation is a standard and quite well-known technique for improving Matlab loop run-time performance. Today’s article will show that there is more than meets the eye for even such a simple coding technique.

A note of caution: in the examples that follow, don’t take any speedup as an expected actual value – the actual value may well be different on your system. Your mileage may vary. I only mean to display the relative differences between different alternatives.

The underlying problem

Memory management has a direct influence on performance. I have already shown some examples of this in past articles here.

Preallocation solves a basic problem in simple program loops, where an array is iteratively enlarged with new data (dynamic array growth). Unlike other programming languages (such as C, C++, C# or Java) that use static typing, Matlab uses dynamic typing. This means that it is natural and easy to modify array size dynamically during program execution. For example:

fibonacci = [0, 1];
for idx = 3 : 100
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end

While this may be simple to program, it is not wise with regards to performance. The reason is that whenever an array is resized (typically enlarged), Matlab allocates an entirely new contiguous block of memory for the array, copying the old values from the previous block to the new, then releasing the old block for potential reuse. This operation takes time to execute. In some cases, this reallocation might require accessing virtual memory and page swaps, which would take an even longer time to execute. If the operation is done in a loop, then performance could quickly drop off a cliff.

The cost of such naïve array growth is theoretically quadratic. This means that multiplying the number of elements by N multiplies the execution time by about N2. The reason for this is that Matlab needs to reallocate N times more than before, and each time takes N times longer due to the larger allocation size (the average block size multiplies by N), and N times more data elements to copy from the old to the new memory blocks.

A very interesting discussion of this phenomenon and various solutions can be found in a newsgroup thread from 2005. Three main solutions were presented: preallocation, selective dynamic growth (allocating headroom) and using cell arrays. The best solution among these in terms of ease of use and performance is preallocation.

The basics of pre-allocation

The basic idea of preallocation is to create a data array in the final expected size before actually starting the processing loop. This saves any reallocations within the loop, since all the data array elements are already available and can be accessed. This solution is useful when the final size is known in advance, as the following snippet illustrates:

% Regular dynamic array growth:
tic
fibonacci = [0,1];
for idx = 3 : 40000
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end
toc
   => Elapsed time is 0.019954 seconds.
 
% Now use preallocation – 5 times faster than dynamic array growth:
tic
fibonacci = zeros(40000,1);
fibonacci(1)=0; fibonacci(2)=1;
for idx = 3 : 40000, 
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end
toc
   => Elapsed time is 0.004132 seconds.

On pre-R2011a releases the effect of preallocation is even more pronounced: I got a 35-times speedup on the same machine using Matlab 7.1 (R14 SP3). R2011a (Matlab 7.12) had a dramatic performance boost for such cases in the internal accelerator, so newer releases are much faster in dynamic allocations, but preallocation is still 5 times faster even on R2011a.

Non-deterministic pre-allocation

Because the effect of preallocation is so dramatic on all Matlab releases, it makes sense to utilize it even in cases where the data array’s final size is not known in advance. We can do this by estimating an upper bound to the array’s size, preallocate this large size, and when we’re done remove any excess elements:

% The final array size is unknown – assume 1Kx3K upper bound (~23MB)
data = zeros(1000,3000);  % estimated maximal size
numRows = 0;
numCols = 0;
while (someCondition)
   colIdx = someValue1;   numCols = max(numCols,colIdx);
   rowIdx = someValue2;   numRows = max(numRows,rowIdx);
   data(rowIdx,colIdx) = someOtherValue;
end
 
% Now remove any excess elements
data(:,numCols+1:end) = [];   % remove excess columns
data(numRows+1:end,:) = [];   % remove excess rows

Variants for pre-allocation

It turns out that MathWorks’ official suggestion for preallocation, namely using the zeros function, is not the most efficient:

% MathWorks suggested variant
clear data1, tic, data1 = zeros(1000,3000); toc
   => Elapsed time is 0.016907 seconds.
 
% A much faster alternative - 500 times faster!
clear data1, tic, data1(1000,3000) = 0; toc
   => Elapsed time is 0.000034 seconds.

The reason for the second variant being so much faster is because it only allocates the memory, without worrying about the internal values (they get a default of 0, false or ”, in case you wondered). On the other hand, zeros has to place a value in each of the allocated locations, which takes precious time.

In most cases the differences are immaterial since the preallocation code would only run once in the program, and an extra 17ms isn’t such a big deal. But in some cases we may have a need to periodically refresh our data, where the extra run-time could quickly accumulate.

Pre-allocating non-default values

When we need to preallocate a specific value into every data array element, we cannot use Variant #2. The reason is that Variant #2 only sets the very last data element, and all other array elements get assigned the default value (0, ‘’ or false, depending on the array’s data type). In this case, we can use one of the following alternatives (with their associated timings for a 1000×3000 data array):

scalar = pi;  % for example...
 
data = scalar(ones(1000,3000));           % Variant A: 87.680 msecs
data(1:1000,1:3000) = scalar;             % Variant B: 28.646 msecs
data = repmat(scalar,1000,3000);          % Variant C: 17.250 msecs
data = scalar + zeros(1000,3000);         % Variant D: 17.168 msecs
data(1000,3000) = 0; data = data+scalar;  % Variant E: 16.334 msecs

As can be seen, Variants C-E are about twice as fast as Variant B, and 5 times faster than Variant A.

Pre-allocating non-double data

7.4.5 Preallocating non-double data
When preallocating an array of a type that is not double, we should be careful to create it using the desired type, to prevent memory and/or performance inefficiencies. For example, if we need to process a large array of small integers (int8), it would be inefficient to preallocate an array of doubles and type-convert to/from int8 within every loop iteration. Similarly, it would be inefficient to preallocate the array as a double type and then convert it to int8. Instead, we should create the array as an int8 array in the first place:

% Bad idea: allocates 8MB double array, then converts to 1MB int8 array
data = int8(zeros(1000,1000));   % 1M elements
   => Elapsed time is 0.008170 seconds.
 
% Better: directly allocate the array as a 1MB int8 array – x80 faster
data = zeros(1000,1000,'int8');
   => Elapsed time is 0.000095 seconds.

Pre-allocating cell arrays

To preallocate a cell-array we can use the cell function (explicit preallocation), or the maximal cell index (implicit preallocation). Explicit preallocation is faster than implicit preallocation, but functionally equivalent (Note: this is contrary to the experience with allocation of numeric arrays and other arrays):

% Variant #1: Explicit preallocation of a 1Kx3K cell array
data = cell(1000,3000);
   => Elapsed time is 0.004637 seconds. 
 
% Variant #2: Implicit preallocation – x3 slower than explicit
clear('data'), data{1000,3000} = [];
   => Elapsed time is 0.012873 seconds.

Pre-allocating arrays of structs

To preallocate an array of structs or class objects, we can use the repmat function to replicate copies of a single data element (explicit preallocation), or just use the maximal data index (implicit preallocation). In this case, unlike the case of cell arrays, implicit preallocation is much faster than explicit preallocation, since the single element does not actually need to be copied multiple times (ref):

% Variant #1: Explicit preallocation of a 100x300 struct array
element = struct('field1',magic(2), 'field2',{[]});
data = repmat(element, 100, 300);
   => Elapsed time is 0.002804 seconds. 
 
% Variant #2: Implicit preallocation – x7 faster than explicit 
element = struct('field1',magic(2), 'field2',{[]});
clear('data'), data(100,300) = element;
   => Elapsed time is 0.000429 seconds.

When preallocating structs, we can also use a third variant, using the built-in struct feature of replicating the struct when the struct function is passed a cell array. For example, struct('field1',cell(100,1), 'field2',5) will create 100 structs, each of them having the empty field field1 and another field called field2 with value 5. Unfortunately, this variant is slower than both of the previous variants.

Pre-allocating class objects

When preallocating in general, ensure that you are using the maximal expected array size. There is no point in preallocating an empty array or an array having a smaller size than the expected maximum, since dynamic memory reallocation will automatically kick-in within the processing-loop. For this reason, do not use the empty() method of class objects to preallocate, but rather repmat as explained above.

When using repmat to replicate class objects, always be careful to note whether you are replicating the object itself (this happens if your class does NOT derive from handle) or its reference handle (which happens if you derive the class from handle). If you are replicating objects, then you can safely edit any of their properties independently of each other; but if you replicate references, you are merely using multiple copies of the same reference, so that modifying referenced object #1 will also automatically affect all the other referenced objects. This may or may not be suitable for your particular program requirements, so be careful to check carefully. If you actually need to use independent object copies, you will need to call the class constructor multiple times, once for each new independent object.

Next week: what if we can’t avoid dynamic array resizing? – apparently, all is not lost. Stay tuned…


Do you have any similar allocation-related tricks you’re using? or unexpected differences such as the ones shown above? If so, then please do post a comment.

Related posts:

  1. Array resizing performance Several alternatives are explored for dynamic array growth performance in Matlab loops. ...
  2. Matrix processing performance Matrix operations performance is affected by internal subscriptions in a counter-intuitive way....
  3. Allocation performance take 2 The clear function has some non-trivial effects on Matlab performance. ...
  4. Performance: scatter vs. line In many circumstances, the line function can generate visually-identical plots as the scatter function, much faster...
  5. Improving save performance There are many different ways of improving Matlab's standard save function performance. ...
  6. Class object creation performance Performance aspects of Matlab class object creation are discussed, with specific suggestions. ...

Categories: Low risk of breaking in future versions, Memory, Stock Matlab function, Undocumented feature

Tags: , , , ,

Bookmark and SharePrint Print

19 Responses to Preallocation performance

  1. Richard Kennaway says:

    It turns out that MathWorks’ official suggestion for preallocation, namely using the zeros function, is not the most efficient:

    Doesn’t work for me (Matlab 2011a on Mac OS 10.6.8):

    >> clear data1, tic, data1 = zeros(1000,3000); toc
    Elapsed time is 0.005574 seconds.
    >> clear data1, tic, data1(1000,3000) = 0; toc
    Elapsed time is 0.005813 seconds.
    >> clear data1, tic, data1 = zeros(1000,3000); toc
    Elapsed time is 0.005598 seconds.
    >> clear data1, tic, data1(1000,3000) = 0; toc
    Elapsed time is 0.005608 seconds.

    For “Pre-allocating non-default values”, my timings mostly agree, except that C and D are the other way round:

    A 0.076208 seconds.
    B 0.010146 seconds.
    C 0.024495 seconds.
    D 0.014459 seconds.
    E 0.012212 seconds.

    The other examples all worked the same for me as in the post.

  2. Cris says:

    This is strange! Can you explain why the 2nd variant is only 100 faster the first time around? Only after a ‘clear all’ you get a fast execution again.

    >> clear all
    >> clear data1, tic, data1 = zeros(1000,3000); toc
    Elapsed time is 0.020362 seconds.
    >> clear data1, tic, data1 = zeros(1000,3000); toc
    Elapsed time is 0.020228 seconds.
    >> clear data1, tic, data1 = zeros(1000,3000); toc
    Elapsed time is 0.018324 seconds.
    >> clear data2, tic, data2(1000,3000) = 0; toc
    Elapsed time is 0.000206 seconds.
    >> clear data2, tic, data2(1000,3000) = 0; toc
    Elapsed time is 0.022243 seconds.
    >> clear data2, tic, data2(1000,3000) = 0; toc
    Elapsed time is 0.019689 seconds.
    >> clear data2, tic, data2(1000,3000) = 0; toc
    Elapsed time is 0.020465 seconds.
    • @Chris & Richard: perhaps this is platform or situation dependent. On my Windows XP running both R2010b and R2012a, data1=zeros(1000,3000) is consistently hundreds of times slower than data1(1000,3000)=0.

      Perhaps you are running this on a “warm” Matlab session that is running on fragmented memory? Try running on a new Matlab session that does not clash for system memory with other heavy processes (e.g. a multi-tab browser).

    • pedro says:

      greetings!
      “@Chris & Richard: perhaps this is platform or situation dependent. On my Windows XP running both R2010b and R2012a, data1=zeros(1000,3000) is consistently hundreds of times faster than data1(1000,3000)=0.”
      now I’m confused: did you mean slower, by any chance?
      cheers,
      pedro

    • @Pedro – you’re right, it was a typo (now corrected) – thanks :-)

    • Daniel says:

      I can recreate Yair’s differences between data1 = zeros(1000,3000); and data2(1000,3000) = 0; on 32-bit Windows XP 32 with R2010a, but not on 64-bit Linux with R2011a. So it looks like it might be OS or 32/64 bit dependent.

      But I wanted to explore a little more so I decided to up the size of the matrix so it would take more time to compute in Linux

      >> clear data1, tic, data1(1e4,1e4) = 0; toc
      Elapsed time is 0.000044 seconds.
      >> clear data1, tic, data1(1e3,1e3) = 0; toc
      Elapsed time is 0.005977 seconds.
      >> clear data1, tic, data1(1e4,1e4) = 0; toc
      Elapsed time is 0.000040 seconds.
      >> clear data1, tic, data1(1e3,1e3) = 0; toc
      Elapsed time is 0.006443 seconds.
      >> clear data1, tic, data1(1e4,1e4) = 0; toc
      Elapsed time is 0.000030 seconds.
      >> clear data1, tic, data1(1e3,1e3) = 0; toc
      Elapsed time is 0.006265 seconds.

      It appears it is faster to preallocate a 1e4 x 1e4 (800MB) matrix than a 1e3 x 1e3 (8MB) matrix.

  3. Marc says:

    Of course it’s good to profile first, then use this technique if warranted. Otherwise, write clear, simple code that’s easy to reason about. I usually shut off this particular Code Analyzer warning because it’s almost never been an important bottleneck for me. YMMV

  4. Héctor says:

    I run on windows 7, Matlab R2010a. My Matlab has been doing things for 2 days and was release from heavy duty 5 minutes ago. This are my results:

    >> clear data1, tic, data1 = zeros(1000,3000); toc
    Elapsed time is 0.007353 seconds.
    >> clear data1, tic, data1(1000,3000) = 0; toc
    Elapsed time is 0.000027 seconds.
    >> clear data1, tic, data1 = zeros(1000,3000); toc
    Elapsed time is 0.007176 seconds.
    >> clear data1, tic, data1(1000,3000) = 0; toc
    Elapsed time is 0.000023 seconds.
  5. Alexander says:

    Dear Yair,

    in your nice post you used the code

    scalar = pi;  % for example...
     
    data = scalar(ones(1000,3000));           % Variant A: 87.680 msecs
    sum(a)

    I’ve never seen this way of using a scalar, mapping it to an array. Is there any passage in the documentation you could cite or explanation why and how precisely this works?

    Alexander

    • @Alexander – using indexing to repeat array elements is such a standard technique for me that I didn’t even think twice about the fact that it may not be well known… This technique is used numerous times in the Matlab code corpus, although now that you asked I couldn’t find a direct doc reference for it (there probably is some reference, I just couldn’t find it).

      Here is an unofficial article that explains the use.

    • Alexander says:

      Dear Yair,

      thanks for the link. Indeed I’ve never seen this way of indexing befor and I’m sure I’ll use it in many places.

      The main step to understand seems to be to think of scalars being implemented as arrays with a singleton dimension. Then everything becomes clear. Thinks could be so easy …

      Thanks

  6. Pingback: Internal Matlab memory optimizations | Undocumented Matlab

  7. Pingback: preallocation is not an option | Matlabtips.com

  8. Troy says:

    For “Variants for pre-allocation”:

    Using zeros is about as fast for me, as long as you specify the class, e.g.:

    >> clear z; tic; z=zeros(10000,'double'); toc
    Elapsed time is 0.000034 seconds.

    is comparable on my machine to setting z(10000,10000) = 0
    (Elapsed time is 0.000030 seconds.)

  9. Mike says:

    Thanks for putting this information up. To comment on the timings, I think they are too small to be measured correctly this way. You should run these commands in a loop for a couple thousand times to get timings into seconds, otherwise they will be influenced by context switching of your operating system.

    I doubt it matters if you use the zeroes command or direct indexing. As you say, the array gets initialized with zeroes either way (FALSE or ” are zeroes as well). Using different data types might cause a smaller number of bytes to be allocated and a smaller initialization time.

  10. natan says:

    Dear Yair,
    Thanks for this fascinating post.
    Incidentally, after noticing that empty matrix multiplication also produces a matrix zeros:

    zeros(3,0)*zeros(0,3)
    ans =
     0     0     0
     0     0     0
     0     0     0

    I’ve discovered that this is another way to improve pre-allocation performance. For example,

    f=@() zeros(1000)
        timeit(f)
        ans =
            0.0033

    vs:

    g=@() zeros(1000,0)*zeros(0,1000)
    timeit(g)
    ans =
            9.2048e-06

    That improvement is array size dependent and only happens above some matrix size, more info can be seen in a StackOverflow post I’ve recently posted.

    • @Nathan – Thanks for the update. I think that a(1000,1000)=0; is simpler to use. IMHO, the potential minor extra saving by the product trick is not worth the extra complexity. Still, I admit it’s a very neat trick.

  11. Pingback: Allocation performance take 2 | Undocumented Matlab

  12. Pingback: Class object creation performance | Undocumented Matlab

Leave a Reply

Your email address will not be published. Required fields are marked *

*

<pre lang="matlab">
a = magic(3);
sum(a)
</pre>