Memory – Undocumented Matlab

Blocked wait with timeout for asynchronous events

Yair Altman — Sun, 13 May 2018 20:22:08 +0000

Readers of this website may have noticed that I have recently added an IQML section to the website’s top menu bar. IQML is a software connector that connects Matlab to DTN’s IQFeed, a financial data-feed of live and historic market data. IQFeed, like most other data-feed providers, sends its data in asynchronous messages, which need to be processed one at a time by the receiving client program (Matlab in this case). I wanted IQML to provide users with two complementary modes of operation:

Streaming (asynchronous, non-blocking) – incoming server data is processed by internal callback functions in the background, and is made available for the user to query at any later time.
Blocking (synchronously waiting for data) – in this case, the main Matlab processing flows waits until the data arrives, or until the specified timeout period has passed – whichever comes first.

Implementing streaming mode is relatively simple in general – all we need to do is ensure that the underlying connector object passes the incoming server messages to the relevant Matlab function for processing, and ensure that the user has some external way to access this processed data in Matlab memory (in practice making the connector object pass incoming data messages as Matlab callback events may be non-trivial, but that’s a separate matter – read here for details).
In today’s article I’ll explain how we can implement a blocking mode in Matlab. It may sound difficult but it turns out to be relatively simple.
I had several requirements/criteria for my blocked-wait implementation:

Compatibility – It had to work on all Matlab platforms, and all Matlab releases in the past decade (which rules out using Microsoft Dot-NET objects)
Ease-of-use – It had to work out-of-the-box, with no additional installation/configuration (which ruled out using Perl/Python objects), and had to use a simple interface function
Timeout – It had to implement a timed-wait, and had to be able to tell whether the program proceeded due to a timeout, or because the expected event has arrived
Performance – It had to have minimal performance overhead

The basic idea

The basic idea is to use Matlab’s builtin waitfor, as I explained in another post back in 2012. If our underlying connector object has some settable boolean property (e.g., Done) then we can set this property in our event callback, and then waitfor(object,'Done'). The timeout mechanism is implemented using a dedicated timer object, as follows:

% Wait for data updates to complete (isDone = false if timeout, true if event has arrived)
function isDone = waitForDone(object, timeout)
    % Initialize: timeout flag = false
    object.setDone(false);
    % Create and start the separate timeout timer thread
    hTimer = timer('TimerFcn',@(h,e)object.setDone(true), 'StartDelay',timeout);
    start(hTimer);
    % Wait for the object property to change or for timeout, whichever comes first
    waitfor(object,'Done');
    % waitfor is over - either because of timeout or because the data changed
    % To determine which, check whether the timer callback was activated
    isDone = isvalid(hTimer) && hTimer.TasksExecuted == 0;
    % Delete the timer object
    try stop(hTimer);   catch, end
    try delete(hTimer); catch, end
    % Return the flag indicating whether or not timeout was reached
end  % waitForDone

where the event callback is responsible for invoking object.setDone(true) when the server data arrives. The usage would then be similar to this:

requestDataFromServer();
if isBlockingMode
   % Blocking mode - wait for data or timeout (whichever comes first)
   isDone = waitForDone(object, 10.0);  % wait up to 10 secs for data to arrive
   if ~isDone  % indicates a timeout
      fprintf(2, 'No server data has arrived within the specified timeout period!\n')
   end
else
   % Non-blocking (streaming) mode - continue with regular processing
end

Using a stand-alone generic signaling object

But what can we do if we don’t have such a Done property in our underlying object, or if we do not have programmatic access to it?
We could create a new non-visible figure and then waitfor one of its properties (e.g. Resize). The property would be initialized to 'off', and within both the event and timer callbacks we would set it to 'on', and then waitfor(hFigure,'Resize','on'). However, figures, even if non-visible, are quite heavy objects in terms of both memory, UI resources, and performance.
It would be preferable to use a much lighter-weight object, as long as it abides by the other criteria above. Luckily, there are numerous such objects in Java, which is bundled in every Matlab since 2000, on every Matlab platform. As long as we choose a small Java object that has existed forever, we should be fine. For example, we could use a simple javax.swing.JButton and its boolean property Enabled:

hSemaphore = handle(javax.swing.JButton);  % waitfor() expects a handle() object, not a "naked" Java reference
% Wait for data updates to complete (isDone = false if timeout, true if event has arrived)
function isDone = waitForDone(hSemaphore, timeout)
    % Initialize: timeout flag = false
    hSemaphore.setEnabled(false);
    % Create and start the separate timeout timer thread
    hTimer = timer('TimerFcn',@(h,e)hSemaphore.setEnabled(true), 'StartDelay',timeout);
    start(hTimer);
    % Wait for the object property to change or for timeout, whichever comes first
    waitfor(hSemaphore,'Enabled');
    % waitfor is over - either because of timeout or because the data changed
    % To determine which, check whether the timer callback was activated
    isDone = isvalid(hTimer) && hTimer.TasksExecuted == 0;
    % Delete the timer object
    try stop(hTimer);   catch, end
    try delete(hTimer); catch, end
    % Return the flag indicating whether or not timeout was reached
end  % waitForDone

In this implementation, we would need to pass the hSemaphore object handle to the event callback, so that it would be able to invoke hSemaphore.setEnabled(true) when the server data has arrived.
Under the hood, note that Enabled is not a true “property” of javax.swing.JButton, but rather exposes two simple public getter/setter methods (setEnabled() and isEnabled()), which Matlab interprets as a “property”. For all intents and purposes, in our Matlab code we can treat Enabled as a property of javax.swing.JButton, including its use by Matlab’s builtin waitfor function.
There is a light overhead to this: on my laptop the waitfor function returns ~20 mSecs after the invocation of hSemaphore.setEnabled(true) in the timer or event callback – in many cases this overhead is negligible compared to the networking latencies for the blocked data request. When event reactivity is of utmost importance, users can always use streaming (non-blocking) mode, and process the incoming data events immediately in a callback.
Of course, it would have been best if MathWorks would add a timeout option and return value to Matlab’s builtin waitfor function, similar to my waitForDone function – this would significantly simplify the code above. But until this happens, you can use waitForDone pretty-much as-is. I have used similar combinations of blocking and streaming modes with multiple other connectors that I implemented over the years (Interactive Brokers, CQG, Bloomberg and Reuters for example), and the bottom line is that it actually works well in practice.
Let me know if you’d like me to assist with your own Matlab project or data connector, either developing it from scratch or improving your existing code. I will be visiting Boston and New York in early June and would be happy to meet you in person to discuss your specific needs.

The post Blocked wait with timeout for asynchronous events appeared first on Undocumented Matlab.

Quirks with parfor vs. for

Yair Altman — Thu, 05 Jan 2017 17:15:48 +0000

A few months ago, I discussed several tips regarding Matlab’s parfor command, which is used by the Parallel Computing Toolbox (PCT) for parallelizing loops. Today I wish to extend that post with some unexplained oddities when using parfor, compared to a standard for loop.

Data serialization quirks

Dimitri Shvorob may not appear at first glance to be a prolific contributor on Matlab Central, but from the little he has posted over the years I regard him to be a Matlab power-user. So when Dimitri reports something, I take it seriously. Such was the case several months ago, when he contacted me regarding very odd behavior that he saw in his code: the for loop worked well, but the parfor version returned different (incorrect) results. Eventually, Dimitry traced the problem to something originally reported by Dan Austin on his Fluffy Nuke It blog.
The core issue is that if we have a class object that is used within a for loop, Matlab can access the object directly in memory. But with a parfor loop, the object needs to be serialized in order to be sent over to the parallel workers, and deserialized within each worker. If this serialization/deserialization process involves internal class methods, the workers might see a different version of the class object than the one seen in the serial for loop. This could happen, for example, if the serialization/deserialization method croaks on an error, or depends on some dynamic (or random) conditions to create data.
In other words, when we use data objects in a parfor loop, the data object is not necessarily sent “as-is”: additional processing may be involved under the hood that modify the data in a way that may be invisible to the user (or the loop code), resulting in different processing results of the parallel (parfor) vs. serial (for) loops.
For additional aspects of Matlab serialization/deserialization, see my article from 2 years ago (and its interesting feedback comments).

Data precision quirks

The following section was contributed by guest blogger Lior Perlmuter-Shoshany, head algorithmician at a private equity fund.
In my work, I had to work with matrixes in the order of 10⁹ cells. To reduce the memory footprint (and hopefully also improve performance), I decided to work with data of type single instead of Matlab’s default double. Furthermore, in order to speed up the calculation I use parfor rather than for in the main calculation. In the end of the run I am running a mini for-loop to see the best results.
What I discovered to my surprise is that the results from the parfor and for loop variants is not the same!

The following simplified code snippet illustrate the problem by calculating a simple standard-deviation (std) over the same data, in both single– and double-precision. Note that the loops are ran with only a single iteration, to illustrate the fact that the problem is with the parallelization mechanism (probably the serialization/deserialization parts once again), not with the distribution of iterations among the workers.

clear
rng('shuffle','twister');
% Prepare the data in both double and single precision
arr_double = rand(1,100000000);
arr_single = single(arr_double);
% No loop - direct computation
std_single0 = std(arr_single);
std_double0 = std(arr_double);
% Loop #1 - serial for loop
std_single = 0;
std_double = 0;
for i=1
    std_single(i) = std(arr_single);
    std_double(i) = std(arr_double);
end
% Loop #2 - parallel parfor loop
par_std_single = 0;
par_std_double = 0;
parfor i=1
    par_std_single(i) = std(arr_single);
    par_std_double(i) = std(arr_double);
end
% Compare results of for loop vs. non-looped computation
isForSingleOk = isequal(std_single, std_single0)
isForDoubleOk = isequal(std_double, std_double0)
% Compare results of single-precision data (for vs. parfor)
isParforSingleOk = isequal(std_single, par_std_single)
parforSingleAccuracy = std_single / par_std_single
% Compare results of double-precision data (for vs. parfor)
isParforDoubleOk = isequal(std_double, par_std_double)
parforDoubleAccuracy = std_double / par_std_double

Output example :

isForSingleOk =
    1                   % <= true (of course!)
isForDoubleOk =
    1                   % <= true (of course!)
isParforSingleOk =
    0                   % <= false (odd!)
parforSingleAccuracy =
    0.73895227413361    % <= single-precision results are radically different in parfor vs. for
isParforDoubleOk =
    0                   % <= false (odd!)
parforDoubleAccuracy =
    1.00000000000021    % <= double-precision results are almost [but not exactly] the same in parfor vs. for

From my testing, the larger the data array, the bigger the difference is between the results of single-precision data when running in for vs. parfor.
In other words, my experience has been that if you have a huge data matrix, it's better to parallelize it in double-precision if you wish to get [nearly] accurate results. But even so, I find it deeply disconcerting that the results are not exactly identical (at least on R2015a-R2016b on which I tested) even for the native double-precision .
Hmmm... bug?

Upcoming travels - Zürich & Geneva

I will shortly be traveling to clients in Zürich and Geneva, Switzerland. If you are in the area and wish to meet me to discuss how I could bring value to your work with some advanced Matlab consulting or training, then please email me (altmany at gmail):

Zürich: January 15-17
Geneva: January 18-21

Happy new year everybody!

The post Quirks with parfor vs. for appeared first on Undocumented Matlab.

General-use object copy

Yair Altman — Wed, 06 May 2015 13:30:23 +0000

When using Matlab objects, either a Matlab class (MCOS) or any other (e.g., Java, COM, C# etc.), it is often useful to create a copy of the original object, complete with all internal property values. This enables modification of the new copy without affecting the original object. This is not important for MCOS value-class objects, since value objects use the COW (Copy-on-Write/Update, a.k.a. Lazy Copy) and this is handled automatically by the Matlab interpreter when it detects that a change is made to the copy reference. However, it is very important for handle objects, where modifying any property of the copied object also modifies the original object.
Most OOP languages include some sort of a copy constructor, which enables programmers to duplicate a handle/reference object, internal properties included, such that it becomes entirely separate from the original object. Unfortunately, Matlab did not include such a copy constructor until R2011a (matlab.mixin.Copyable.copy()).
On Matlab R2010b and older, as well as on newer releases, we do not have a readily-available solution for handle object copy. Until now, that is.

There are several ways by which we can create such a copy function. We might call the main constructor to create a default object and then override its properties by iterating over the original object’s properties. This might work in some cases, but not if there is no default constructor for the object, or if there are side-effects to object property modifications. If we wanted to implement a deep (rather than shallow) copy, we’d need to recursively iterate over all the properties of the internal objects as well.
A simpler solution might be to save the object to a temporary file (tempname, then load from that file (which creates a copy), and finally delete the temp file. This is nice and clean, but the extra I/O could be relatively slow compared to in-memory processing.
Which leads us to today’s chosen solution, where we use Matlab’s builtin functions getByteStreamFromArray and getArrayFromByteStream, which I discussed last year as a way to easily serialize and deserialize Matlab data of any type. Specifically, getArrayFromByteStream has the side-effect of creating a duplicate of the serialized data, which is perfect for our needs here (note that these pair of function are only available on R2010b or newer; on R2010a or older we can still serialize via a temp file):

% Copy function - replacement for matlab.mixin.Copyable.copy() to create object copies
function newObj = copy(obj)
    try
        % R2010b or newer - directly in memory (faster)
        objByteArray = getByteStreamFromArray(obj);
        newObj = getArrayFromByteStream(objByteArray);
    catch
        % R2010a or earlier - serialize via temp file (slower)
        fname = [tempname '.mat'];
        save(fname, 'obj');
        newObj = load(fname);
        newObj = newObj.obj;
        delete(fname);
    end
end

This function can be placed anywhere on the Matlab path and will work on all recent Matlab releases (including R2010b and older), any type of Matlab data (including value or handle objects, UDD objects, structs, arrays etc.), as well as external objects (Java, C#, COM). In short, it works on anything that can be assigned to a Matlab variable:

obj1 = ... % anything really!
obj2 = obj1.copy();  % alternative #1
obj2 = copy(obj1);   % alternative #2

Alternative #1 may look “nicer” to a computer scientist, but alternative #2 is preferable because it also handles the case of non-object data (e.g., [] or ‘abc’ or magic(5) or a struct or cell array), whereas alternative #1 would error in such cases.
In any case, using either alternatives, we no longer need to worry about inheriting our MCOS class from matlab.mixin.Copyable, or backward compatibility with R2010b and older (I may possibly be bashed for this statement, but in my eyes future compatibility is less important than backward compatibility). This is not such a wild edge-case. In fact, I came across the idea for this post last week, when I developed an MCOS project for a consulting client that uses both R2010a and R2012a, and the same code needed to run on both Matlab releases.
Using the serialization functions also solves the case of creating copies for Java/C#/COM objects, which currently have no other solution, except if these objects happen to contain their own copy method.
In summary, using Matlab’s undocumented builtin serialization functions enables easy implementation of a very efficient (in-memory) copy constructor, which is expected to work across all Matlab types and many releases, without requiring any changes to existing code – just placing the above copy function on the Matlab path. This is expected to continue working properly until Matlab decides to remove the serialization functions (which should hopefully never happen, as they are so useful).
Sometimes, the best solutions lie not in sophisticated new features (e.g., matlab.mixin.Copyable), but by using plain ol’ existing building blocks. There’s a good lesson to be learned here I think.
p.s. – I do realize that matlab.mixin.Copyable provides the nice feature of enabling users to control the copy process, including implementing deep or shallow or selective copy. If that’s your need and you have R2011a or newer then good for you, go ahead and inherit Copyable. Today’s post was meant for the regular Joe who doesn’t need this fancy feature, but does need to support R2010b, and/or a simple way to clone Java/C#/COM objects.

The post General-use object copy appeared first on Undocumented Matlab.

New book: Accelerating MATLAB Performance

Yair Altman — Tue, 16 Dec 2014 21:22:03 +0000

I am pleased to announce that after three years of research and hard work, following my first book on Matlab-Java programming, my new book “Accelerating MATLAB Performance” is finally published.

The Matlab programming environment is often perceived as a platform suitable for prototyping and modeling but not for “serious” applications. One of the main complaints is that Matlab is just too slow.
Accelerating MATLAB Performance (CRC Press, ISBN 9781482211290, 785 pages) aims to correct this perception, by describing multiple ways to greatly improve Matlab program speed.
The book:

Demonstrates how to profile MATLAB code for performance and resource usage, enabling users to focus on the program’s actual hotspots
Considers tradeoffs in performance tuning, horizontal vs. vertical scalability, latency vs. throughput, and perceived vs. actual performance
Explains generic speedup techniques used throughout the software industry and their adaptation for Matlab, plus methods specific to Matlab
Analyzes the effects of various data types and processing functions
Covers vectorization, parallelization (implicit and explicit), distributed computing, optimization, memory management, chunking, and caching
Explains Matlab’s memory model and shows how to profile memory usage and optimize code to reduce memory allocations and data fetches
Describes the use of GPU, MEX, FPGA, and other forms of compiled code
Details acceleration techniques for GUI, graphics, I/O, Simulink, object-oriented Matlab, Matlab startup, and deployed applications
Discusses a wide variety of MathWorks and third-party functions, utilities, libraries, and toolboxes that can help to improve performance

Ideal for novices and professionals alike, the book leaves no stone unturned. It covers all aspects of Matlab, taking a comprehensive approach to boosting Matlab performance. It is packed with thousands of helpful tips, code examples, and online references. Supported by this active website, the book will help readers rapidly attain significant reductions in development costs and program run times.
Additional information about the book, including detailed Table-of-Contents, book structure, reviews, resources and errata list, can be found in a dedicated webpage that I’ve prepared for this book and plan to maintain.

Click here to get your book copy now!
Use promo code MZK07 for a 25% discount and free worldwide shipping on crcpress.com
Instead of focusing on just a single performance aspect, I’ve attempted to cover all bases at least to some degree. The basic idea is that there are numerous different ways to speed up Matlab code: Some users might like vectorization, others may prefer parallelization, still others may choose caching, or smart algorithms, or better memory-management, or compiled C code, or improved I/O, or faster graphics. All of these alternatives are perfectly fine, and the book attempts to cover every major alternative. I hope that you will find some speedup techniques to your liking among the alternatives, and at least a few new insights that you can employ to improve your program’s speed.
I am the first to admit that this book is far from perfect. There are several topics that I would have loved to explore in greater detail, and there are probably many speedup tips that I forgot to mention or have not yet discovered. Still, with over 700 pages of speedup tips, I thought this book might be useful enough as-is, flawed as it may be. After all, it will never be perfect, but I worked very hard to make it close enough, and I really hope that you’ll agree.
If your work relies on Matlab code performance in any way, you might benefit by reading this book. If your organization has several people who might benefit, consider inviting me for dedicated onsite training on Matlab performance and other advanced Matlab topics.
As always, your comments and feedback would be greatly welcome – please post them directly on the book’s webpage.
Happy Holidays everybody!

The post New book: Accelerating MATLAB Performance appeared first on Undocumented Matlab.

Allocation performance take 2

Yair Altman — Wed, 14 Aug 2013 18:00:05 +0000

Last week, Mike Croucher posted a very interesting article on the fact that cumprod can be used to generate a vector of powers much more quickly than the built-in .^ operator. Trying to improve on Mike’s results, I used my finding that zeros(n,m)+scalar is often faster than ones(n,m)*scalar (see my article on pre-allocation performance). Applying this to Mike’s powers-vector example, zeros(n,m)+scalar only gave me a 25% performance boost (i.e., 1-1/1.25 or 20% faster), rather than the x5 speedup that I received in my original article.
Naturally, the difference could be due to different conditions: a different running platform, OS, Matlab release, and allocation size. But the difference felt intriguing enough to warrant a small investigation. I came up with some interesting new findings, that I cannot fully explain:

The performance of allocating zeros, ones (x100)

function t=perfTest
    % Run tests multiple time, for multiple allocation sizes
    n=100; tidx = 1; iters = 100;
    while n < 1e8
        t(tidx,1) = n;
        clear y; tic, for idx=1:iters, clear y; y=ones(n,1);  end, t(tidx,2)=toc;  % clear; ones()
        clear y; tic, for idx=1:iters, clear y; y=zeros(n,1); end, t(tidx,3)=toc;  % clear; zeros()
        clear y; tic, for idx=1:iters,          y=ones(n,1);  end, t(tidx,4)=toc;  % only ones()
        clear y; tic, for idx=1:iters,          y=zeros(n,1); end, t(tidx,5)=toc;  % only zeros()
        n = n * 2;
        tidx = tidx + 1;
    end
    % Normalize result on a per-element basis
    t2 = bsxfun(@rdivide, t(:,2:end), t(:,1));
    % Display the results
    h  = loglog(t(:,1), t(:,2:end));  % overall durations
    %h = loglog(t(:,1), t2);  % normalized durations
    set(h, 'LineSmoothing','on');  % see https://undocumentedmatlab.com/articles/plot-linesmoothing-property/
    set(h(2), 'LineStyle','--', 'Marker','+', 'MarkerSize',5, 'Color',[0,.5,0]);
    set(h(3), 'LineStyle',':',  'Marker','o', 'MarkerSize',5);
    set(h(4), 'LineStyle','-.', 'Marker','*', 'MarkerSize',5);
    legend(h, 'clear; ones', 'clear; zeros', 'ones', 'zeros', 'Location','NorthWest');
    xlabel('# allocated elements');
    ylabel('duration [secs]');
    box off
end

The full results were (R2013a, Win 7 64b, 8MB):

The same data normalized per-element (x100)

       n  clear,ones  clear,zeros    only ones   only zeros
========  ==========  ===========    =========   ==========
     100    0.000442     0.000384     0.000129     0.000124
     200    0.000390     0.000378     0.000150     0.000121
     400    0.000404     0.000424     0.000161     0.000151
     800    0.000422     0.000438     0.000165     0.000176
    1600    0.000583     0.000516     0.000211     0.000206
    3200    0.000656     0.000606     0.000325     0.000296
    6400    0.000863     0.000724     0.000587     0.000396
   12800    0.001289     0.000976     0.000975     0.000659
   25600    0.002184     0.001574     0.001874     0.001360
   51200    0.004189     0.002776     0.003649     0.002320
  102400    0.010900     0.005870     0.010778     0.005487
  204800    0.051658     0.000966     0.049570     0.000466
  409600    0.095736     0.000901     0.095183     0.000463
  819200    0.213949     0.000984     0.219887     0.000817
 1638400    0.421103     0.001023     0.429692     0.000610
 3276800    0.886328     0.000936     0.877006     0.000609
 6553600    1.749774     0.000972     1.740359     0.000526
13107200    3.499982     0.001108     3.550072     0.000649
26214400    7.094449     0.001144     7.006229     0.000712
52428800   14.039551     0.001853    14.396687     0.000822

(Note: all numbers should be divided by the number of loop iterations iters=100)
As can be seen from the data and resulting plots (log-log scale), the more elements we allocate, the longer this takes. It is not surprising that in all cases the allocation duration is roughly linear, since when twice as many elements need to be allocated, this roughly takes twice as long. It is also not surprising to see that the allocation has some small overhead, which is apparent when allocating a small number of elements.
A potentially surprising result, namely that allocating 200-400 elements is in some cases a bit faster than allocating only 100 elements, can actually be attributed to measurement inaccuracies and JIT warm-up time.
Another potentially surprising result, that zeros is consistently faster than ones can perhaps be explained by zeros being able to use more efficient low-level functions (bzero) for clearing memory, than ones which needs to memset a value.
A somewhat more surprising effect is that of the clear command: As can be seen in the code, calling clear within the timed loops has no functional use, because in all cases the variable y is being overridden with new values. However, we clearly see that the overhead of calling clear is an extra 3&#181S or so per call. Calling clear is important in cases where we deal with very large memory constructs: clearing them from memory enables additional memory allocations (of the same or different variables) without requiring virtual memory paging, which would be disastrous for performance. But if we have a very large loop which calls clear numerous times and does not serve such a purpose, then it is better to remove this call: although the overhead is small, it accumulates and might be an important factor in very large loops.
Another aspect that is surprising is the fact that zeros (with or without clear) is much faster when allocating 200K+ elements, compared to 100K elements. This is indicative of an internal switch to a more optimized allocation algorithm, which apparently has constant speed rather than linear with allocation size. At the very same point, there is a corresponding performance degradation in the allocation of ones. I suspect that 100K is the point at which Matlab’s internal parallelization (multi-threading) kicks in. This occurs at varying points for different functions, but it is normally some multiple of 20K elements (20K, 40K, 100K or 200K – a detailed list was posted by Mike Croucher again). Apparently, it kicks-in at 100K for zeros, but for some reason not for ones.
The performance degradation at 100K elements has been around in Matlab for ages – I see it as far back as R12 (Matlab 6.0), for both zeros and ones. The reason for it is unknown to me, if anyone could illuminate me, I’d be happy to hear. The new thing is the implementation of a faster internal mechanism (presumably multi-threading) in R2008b (Matlab 7.7) for zeros, at the very same point (100K elements), although for some unknown reason this was not done for ones as well.
Another aspect that is strange here is that the speedup for zeros at 200K elements is ~12 – much higher than the expected optimal speedup of 4 on my quad-core system. The higher speedup may perhaps be explained by hyper-threading or SIMD at the CPU level.
In any case, going back to the original reason I started this investigation, the reason for getting such wide disparity in speedups between using zeros and ones for 10K elements (as in Mike Croucher’s post), and for 3M elements (as in my pre-allocation performance article) now becomes clear: In the case of 10K elements, multi-threading is not active, and zeros is indeed only 20-30% faster than ones; In the case of 3M elements, the superior multi-threading of zeros over ones enables much larger speedups, increasing with allocated size.
Some take-away lessons:

Using zeros is always preferable to ones, especially for more than 100K elements on Matlab 7.7 (R2008b) and newer.
zeros(n,m)+scalar is consistently faster than ones(n,m)*scalar, especially for more than 100K elements, and for the same reason.
In some cases, it may be worth to use a built-in function with more elements than actually necessary, just to benefit from its internal multi-threading.
Never take performance assumptions for granted. Always test on your specific system using a representative data-set.

p.s. – readers who are interested in the historical evolution of the zeros function are referred to Loren Shure’s latest post, only a few days ago (a fortunate coincidence indeed). Unfortunately, Loren’s post does not illuminate the mysteries above.

The post Allocation performance take 2 appeared first on Undocumented Matlab.

File deletion memory leaks, performance

Yair Altman — Wed, 05 Sep 2012 18:00:32 +0000

Last week I wrote about Matlab’s built-in pause function, that not only leaks memory but also appears to be less accurate than the equivalent Java function. Today I write about a very similar case. Apparently, using Matlab’s delete function not only leaks memory but is also slower than the equivalent Java function.

Memory leak

The memory leak in delete was (to the best of my knowledge) originally reported in the CSSM newsgroup and on this blog a few weeks ago. The reporter mentioned that after deleting 760K files using delete, he got a Java Heap Space out-of-memory error. The reported solution was to use the Java equivalent, java.io.File(filename).delete(), which does not leak anything.
I was able to recreate the report on my WinXP R2012a system, and discovered what appears to be a memory leak of ~150 bytes per file. This appears to be a very small number, but multiply by 760K (=111MB) and you can understand the problem. Of course, you can always increase the size of the Java heap used by Matlab (here’s how), but this should only be used as a last resort and certainly not when the solution is so simple.

For those interested, here’s the short test harness that I’ve used to test the memory leak:

function perfTest()
    rt = java.lang.Runtime.getRuntime;
    rt.gc();
    java.lang.Thread.sleep(1000);  % wait 1 sec to let the GC time to finish
    orig = rt.freeMemory;  % in bytes
    testSize = 50000;
    for idx = 1 : testSize
        % Create a temp file
        tn = [tempname '.tmp'];
        fid = fopen(tn,'wt');
        fclose(fid);
        % Delete the temp file
        delete(tn);
        %java.io.File(tn).delete();
    end
    rt.gc();
    java.lang.Thread.sleep(1000);  % wait 1 sec to let the GC time to finish
    free = rt.freeMemory;
    totalLeak = orig - free;
    leakPerCall = totalLeak / testSize
end

I placed it in a function to remove command-prompt-generated fluctuations, but it must still be run several times to smooth the data. The main reason for the changes across runs is the fact that the Java heap is constantly growing and shrinking in a seesaw manner, and explicitly calling the garbage collector as I have done does not guarantee that it actually gets performed immediately or fully. By running a large-enough loop, and rerunning the test several times, the results become consistent due to the law of large numbers.
Running the test above with the delete line commented and the java.io.File line uncommented, shows no discernible memory leak.
To monitor Matlab’s Java heap space size in runtime, see my article from several months ago, or use Elmar Tarajan’s memory-monitor utility from the File Exchange.
Note: there are numerous online resources about Java’s garbage collector. Here’s one interesting article that I have recently come across.

Performance

When running the test function using java.io.File, we notice a significant speedup compared to running using delete. The reason is that (at least on my system, YMMV) delete takes 1.5-2 milliseconds to run while java.io.File only takes 0.4-0.5 ms. Again, this doesn’t seem like much, but multiply by thousands of files and it starts to be appreciable. For our 50K test harness, the difference translates into ~50 seconds, or 40% of the overall time.
Since we’re dealing with file I/O, it is important to run the testing multiple times and within a function (not the Matlab Command Prompt), to get rid of spurious measurement artifacts.
Have you encountered any other Matlab function, where the equivalent in Java is better? If so, please add a comment below.

The post File deletion memory leaks, performance appeared first on Undocumented Matlab.

The Java import directive

Yair Altman — Wed, 13 Jun 2012 18:00:37 +0000

A recent blog post on a site I came across showed me that some users who are using Java in Matlab take the unnecessary precaution of always using the fully-qualified class-name (FQCN) of the Java classes, and are not familiar with the import directive in Matlab. Today I’ll show how to use import to simplify Java usage in Matlab.
Basically, the import function enables Matlab users to declare that a specific class name belongs to a particular Java namespace, without having to specifically state the full namespace in each use. In this regard, Matlab’s import closely mimics Java’s import, and not surprisingly also has similar syntax:

% Alternative 1 - using explicit namespaces
jFrame = javax.swing.JFrame;
jDim = java.awt.Dimension(50,120);
jPanel.add(jButton, java.awt.GridBagConstraints(0, 0, 1, 1, 1.0, 1.0, ...
                    java.awt.GridBagConstraints.NORTHWEST, ...
                    java.awt.GridBagConstraints.NONE, ...
                    java.awt.Insets(6, 12, 6, 6), 1, 1));
% Alternative 2 - using import
import javax.swing.*
import java.awt.*
jFrame = JFrame;
jDim = Dimension(50,120);
jPanel.add(jButton, GridBagConstraints(0, 0, 1, 1, 1.0, 1.0, ...
                    GridBagConstraints.NORTHWEST, ...
                    GridBagConstraints.NONE, ...
                    Insets(6, 12, 6, 6), 1, 1));

Note how much cleaner Alternative #2 looks compared to Alternative #1. However, as with Java’s import, there is a tradeoff here: by removing the namespaces from the code, it could become confusing as to which namespace a particular object belongs. For example, by specifying java.awt.Insets, we immediately know that it’s an AWT insets object, rather than, say, a book’s insets. There is no clear-cut answer to this dilemma, and in fact there are many Java developers who prefer one way or the other. As in Java, the choice is yours to make also in Matlab.
Perhaps a good compromise, one which I often use, is to stay away from the import something.* format and directly specify the imported classes. In the example above, I would have written:

% Alternative 3 - using explicit import
import javax.swing.JFrame;
import java.awt.Dimension;
import java.awt.GridBagConstraints;
import java.awt.Insets;
jFrame = JFrame;
jDim = Dimension(50,120);
jPanel.add(jButton, GridBagConstraints(0, 0, 1, 1, 1.0, 1.0, ...
                    GridBagConstraints.NORTHWEST, ...
                    GridBagConstraints.NONE, ...
                    Insets(6, 12, 6, 6), 1, 1));

This alternative has the benefit that it is immediately clear that Insets belongs to the AWT package, without having to explicitly use the java.awt prefix everywhere. Obviously, if the list of imported classes becomes too large, we could always revert to the import java.awt.* format.
Interestingly, we can also use the functional form of import:

% Alternative #1
import java.lang.String;
% Alternative #2
import('java.lang.String');
% Alternative #3
classname = 'java.lang.String';
import(classname);

Using the third alternative format, that of dynamic import, enables us to decide in run-time(!) whether to use a class C from package PA or PB. This is a cool feature but must be used with care, since it could lead to very difficult-to-diagnose errors. For example, if the code later invokes a method that exists only in PA.C but not in PB.C. The correct way to do this would probably be to define a class hierarchy where PA.C and PB.C both inherit from the same superclass. But in some cases this is simply not feasible (for example, when you have 2 JARs from different vendors, which use the same classname) and dynamic importing can help.
It is possible to specify multiple input parameters to import in the same directive. However, note that Matlab 7.5 R2007b and older releases crash (at least on WinXP) when one of the imported parameters is any MathWorks-derived (com.mathworks...) package/class. This bug was fixed in Matlab 7.6 R2008a, but to support earlier releases simply separate such imports into different lines:

% This crashes Matlab 7.5 R2007b and earlier;  OK on Matlab 7.6 R2008a and later
import javax.swing.* com.mathworks.mwswing.*
% This is ok in all Matlab releases
import javax.swing.*
import com.mathworks.mwswing.*

import does NOT load the Java class into memory – it just declares its namespace for the JVM. This mechanism is sometimes called lazy loading (compare to the lazy copying mechanism that I described a couple of weeks ago). To force-load a class into memory, either use it directly (for example, by declaring an object of it, or by using one of its methods), or use a classloader to load it. The issue of JVM classloaders in Matlab is non-trivial (there are several non-identical alternatives), and will be covered in a future article.
A few additional notes:

Although not strictly mandatory, it is good practice to place all the import directives at the top of the function, for visibility and code maintainability reasons
There is no need to end the import declaration with a semicolon (;). It’s really a matter of style consistency. I usually omit it because I find that it is a bit intrusive when placed after a *
import by itself, without any input arguments (class/package names) returns the current list of imported classes/packages
Imported classes and packages can be un-imported using the clear import directive from the Command Window
It has been reported that in some cases using import in deployed (compiled) application fails – the solution is to use the FQCN in such cases

Note: This topic is covered and extended in Chapter 1 of my Matlab-Java programming book

The post The Java import directive appeared first on Undocumented Matlab.

Internal Matlab memory optimizations

Yair Altman — Wed, 30 May 2012 12:09:16 +0000

Yesterday I attended a seminar on developing trading strategies using Matlab. This is of interest to me because of my IB-Matlab product, and since many of my clients are traders in the financial sector. In the seminar, the issue of memory and performance naturally arose. It seemed to me that there was some confusion with regards to Matlab’s built-in memory optimizations. Since I discussed related topics in the past two weeks (preallocation performance, array resizing performance), these internal optimizations seemed a natural topic for today’s article.
The specific mechanisms I’ll describe today are Copy on Write (aka COW or Lazy Copying) and in-place data manipulations. Both mechanisms were already documented (for example, on Loren’s blog or on this blog). But apparently, they are still not well known. Understanting them could help Matlab users modify their code to improve performance and reduce memory consumption. So although this article is not entirely “undocumented”, I’ll give myself some slack today.

Copy on Write (COW, Lazy Copy)

Matlab implements an automatic copy-on-write (sometimes called copy-on-update or lazy copying) mechanism, which transparently allocates a temporary copy of the data only when it sees that the input data is modified. This improves run-time performance by delaying actual memory block allocation until absolutely necessary. COW has two variants: during regular variable copy operations, and when passing data as input parameters into a function:

1. Regular variable copies

When a variable is copied, as long as the data is not modified, both variables actually use the same shared memory block. The data is only copied onto a newly-allocated memory block when one of the variables is modified. The modified variable is assigned the newly-allocated block of memory, which is initialized with the values in the shared memory block before being updated:

data1 = magic(5000);  % 5Kx5K elements = 191 MB
data2 = data1;        % data1 & data2 share memory; no allocation done
data2(1,1) = 0;       % data2 allocated, copied and only then modified

If we profile our code using any of Matlab’s memory-profiling options, we will see that the copy operation data2=data1 takes negligible time to run and allocates no memory. On the other hand, the simple update operation data2(1,1)=0, which we could otherwise have assumed to take minimal time and memory, actually takes a relatively long time and allocates 191 MB of memory.

Copy-on-write effect monitored using the Profiler's -memory option

Copy-on-write effect monitored using Windows Process Explorer

We first see a memory spike (used during the computation of the magic square data), closely followed by a leveling off at 190.7MB above the baseline (this is due to allocation of data1). Copying data2=data1 has no discernible effect on either CPU or memory. Only when we set data2(1,1)=0 does the CPU return, in order to allocate the extra 190MB for data2. When we exit the test function, data1 and data2 are both deallocated, returning the Matlab process memory to its baseline level.
There are several lessons that we can draw from this simple example:
Firstly, creating copies of data does not necessarily or immediately impact memory and performance. Rather, it is the update of these copies which may be problematic. If we can modify our code to use more read-only data and less updated data copies, then we would improve performance. The Profiler report will show us exactly where in our code we have memory and CPU hotspots – these are the places we should consider optimizing.
Secondly, when we see such odd behavior in our Profiler reports (i.e., memory and/or CPU spikes that occur on seemingly innocent code lines), we should be aware of the copy-on-write mechanism, which could be the cause for the behavior.

2. Function input parameters

The copy-on-write mechanism behaves similarly for input parameters in functions: whenever a function is invoked (called) with input data, the memory allocated for this data is used up until the point that one of its copies is modified. At that point, the copies diverge: a new memory block is allocated, populated with data from the shared memory block, and assigned to the modified variable. Only then is the update done on the new memory block.

data1 = magic(5000);      % 5Kx5K elements = 191 MB
data2 = perfTest(data1);
function outData = perfTest(inData)
   outData = inData;   % inData & outData share memory; no allocation
   outData2(1,1) = 0;  % outData allocated, copied and then modified
end

Copy-on-write effect monitored using the Profiler's -memory option

One lesson that can be drawn from this is that whenever possible we should attempt to use functions that do not modify their input data. This is particularly true if the modified input data is very large. Read-only functions will be faster than functions that do even the simplest of data updates.
Another lesson is that perhaps counter intuitively, it does not make a difference from a performance standpoint to pass read-only data to functions as input parameters. We might think that passing large data objects around as function parameters will involve multiple memory allocations and deallocations of the data. In fact, it is only the data’s reference (or more precisely, its mxArray structure) which is being passed around and placed on the function’s call stack. Since this reference/structure is quite small in size, there are no real performance penalties. In fact, this only benefits code clarity and maintainability.
The only case where we may wish to use other means of passing data to functions is when a large data object needs to be updated. In such cases, the updated copy will be allocated to a new memory block with an associated performance cost.

In-place data manipulation

Matlab’s interpreter, at least in recent releases, has a very sophisticated algorithm for using in-place data manipulation (report). Modifying data in-place means that the original data block is modified, rather than creating a new block with the modified data, thus saving any memory allocations and deallocations.
For example, let us manipulate a simple 4Kx4K (122MB) numeric array:

>> m = magic(4000);   % 4Kx4K = 122MB
>> memory
Maximum possible array:            1022 MB (1.072e+09 bytes)
Memory available for all arrays:   1218 MB (1.278e+09 bytes)
Memory used by MATLAB:              709 MB (7.434e+08 bytes)
Physical Memory (RAM):             3002 MB (3.148e+09 bytes)
% In-place array data manipulation: no memory allocated
>> m = m * 0.5;
>> memory
Maximum possible array:            1022 MB (1.072e+09 bytes)
Memory available for all arrays:   1214 MB (1.273e+09 bytes)
Memory used by MATLAB:              709 MB (7.434e+08 bytes)
Physical Memory (RAM):             3002 MB (3.148e+09 bytes)
% New variable allocated, taking an extra 122MB of memory
>> m2 = m * 0.5;
>> memory
Maximum possible array:            1022 MB (1.072e+09 bytes)
Memory available for all arrays:   1092 MB (1.145e+09 bytes)
Memory used by MATLAB:              831 MB (8.714e+08 bytes)
Physical Memory (RAM):             3002 MB (3.148e+09 bytes)

The extra memory allocation of the not-in-place manipulation naturally translates into a performance loss:

% In-place data manipulation, no memory allocation
>> tic, m = m * 0.5; toc
Elapsed time is 0.056464 seconds.
% Regular data manipulation (122MB allocation) – 50% slower
>> clear m2; tic, m2 = m * 0.5; toc;
Elapsed time is 0.084770 seconds.

The difference may not seem large, but placed in a loop it could become significant indeed, and might be much more important if virtual memory swapping comes into play, or when Matlab’s memory space is exhausted (out-of-memory error).
Similarly, when returning data from a function, we should try to update the original data variable whenever possible, avoiding the need for allocation of a new variable:

% In-place data manipulation, no memory allocation
>> d=0:1e-7:1; tic, d = sin(d); toc
Elapsed time is 0.083397 seconds.
% Regular data manipulation (76MB allocation) – 50% slower
>> clear d2, d=0:1e-7:1; tic, d2 = sin(d); toc
Elapsed time is 0.121415 seconds.

Within the function itself we should ensure that we return the modified input variable, and not assign the output to a new variable, so that in-place optimization can also be applied within the function. The in-place optimization mechanism is smart enough to override Matlab’s default copy-on-write mechanism, which automatically allocates a new copy of the data when it sees that the input data is modified:

% Suggested practice: use in-place optimization within functions
function x = function1(x)
   x = someOperationOn(x);   % temporary variable x is NOT allocated
end
% Standard practice: prevents future use of in-place optimizations
function y = function2(x)
   y = someOperationOn(x);   % new temporary variable y is allocated
end

In order to benefit from in-place optimizations of function results, we must both use the same variable in the caller workspace (x = function1(x)) and also ensure that the called function is optimizable (e.g., function x = function1(x)) – if any of these two requirements is not met then in-place function-call optimization is not performed.
Also, for the in-place optimization to be active, we need to call the in-place function from within another function, not from a script or the Matlab Command Window.
A related performance trick is to use masks on the original data rather than temporary data copies. For example, suppose we wish to get the result of a function that acts on only a portion of some large data. If we create a temporary variable that holds the data subset and then process it, it would create an unnecessary copy of the original data:

% Original data
data = 0 : 1e-7 : 1;     % 10^7 elements, 76MB allocated
% Unnecessary copy of data into data2 (extra 8MB allocated)
data2 = data(data>0.1);  % 10^6 elements, 7.6MB allocated
results = sin(data2);    % another 10^6 elements, 7.6MB allocated
% Use of data masks obviates the need for temporary variable data2:
results = sin(data(data>0.1));  % no need for the data2 allocation

A note of caution: we should not invest undue efforts to use in-place data manipulation if the overall benefits would be negligible. It would only help if we have a real memory limitation issue and the data matrix is very large.
Matlab in-place optimization is a topic of continuous development. Code which is not in-place optimized today (for example, in-place manipulation on class object properties) may possibly be optimized in next year’s release. For this reason, it is important to write the code in a way that would facilitate the future optimization (for example, obj.x=2*obj.x rather than y=2*obj.x).
Some in-place optimizations were added to the JIT Accelerator as early as Matlab 6.5 R13, but Matlab 7.3 R2006b saw a major boost. As Matlab’s JIT Accelerator improves from release to release, we should expect in-place data manipulations to be automatically applied in an increasingly larger number of code cases.
In some older Matlab releases, and in some complex data manipulations where the JIT Accelerator cannot implement in-place processing, a temporary storage is allocated that is assigned to the original variable when the computation is done. To implement in-place data manipulations in such cases we could develop an external function (e.g., using Mex) that directly works on the original data block. Note that the officially supported mex update method is to always create deep-copies of the data using mxDuplicateArray() and then modify the new array rather than the original; modifying the original data directly is both discouraged and not officially supported. Doing it incorrectly can easily crash Matlab. If you do directly overwrite the original input data, at least ensure that you unshare any variables that share the same data memory block, thus mimicking the copy-on-write mechanism.
Using Matlab’s internal in-place data manipulation is very useful, especially since it is done automatically without need for any major code changes on our part. But sometimes we need certainty of actually processing the original data variable without having to guess or check whether the automated in-place mechanism will be activated or not. This can be achieved using several alternatives:

Using global or persistent variable
Using a parent-scope variable within a nested function
Modifying a reference (handle class) object’s internal properties

The post Internal Matlab memory optimizations appeared first on Undocumented Matlab.

Array resizing performance

Yair Altman — Wed, 23 May 2012 20:43:03 +0000

As I have explained last week, the best way to avoid the performance penalties associated with dynamic array resizing (typically, growth) in Matlab is to pre-allocate the array to its expected final size. I have shown different alternatives for such preallocation, but in all cases the performance is much better than using a naïve dynamic resize.
Unfortunately, such simple preallocation is not always possible. Apparently, all is not lost. There are still a few things we can do to mitigate the performance pain. As in last week, there is much more here than meets the eye at first sight.
The interesting newsgroup thread from 2005 about this issue that I mentioned last week contains two main solutions to this problem. The effects of these solutions is negligible for small data sizes and/or loop iterations (i.e., number of memory reallocations), but could be dramatic for large data arrays and/or a large number of memory reallocations. The difference could well mean the difference between a usable and an unusable (“hang”) program:

Factor growth: dynamic allocation by chunks

The idea is to dynamically grow the array by a certain percentage factor each time. When the array first needs to grow by a single element, we would in fact grow it by a larger chunk (say 40% of the current array size, for example by using the repmat function, or by concatenating a specified number of zeros, or by setting some way-forward index to 0), so that it would take the program some time before it needs to reallocate memory.
This method has a theoretical cost of N·log(N), which is nearly linear in N for most practical purposes. It is similar to preallocation in the sense that we are preparing a chunk of memory for future array use in advance. You might say that this is on-the-fly preallocation.

Using cell arrays

The idea here is to use cell arrays to store and grow the data, then use cell2mat to convert the resulting cell array to a regular numeric array. Cell elements are implemented as references to distinct memory blocks, so concatenating an object to a cell array merely concatenates its reference; when a cell array is reallocated, only its internal references (not the referenced data) are moved. Note that this relies on the internal implementation of cell arrays in Matlab, and may possibly change in some future release.
Like factor growth, using cell arrays is faster than quadratic behavior (although not quite as fast enough as we would have liked, of course). Different situations may favor using either the cell arrays method or the factor growth mechanism.

The growdata utility

John D’Errico has posted a well-researched utility called growdata that optimizes dynamic array growth for maximal performance. It is based in part on ideas mentioned in the aforementioned 2005 newsgroup thread, where growdata is also discussed in detail.
As an interesting side-note, John D’Errico also recently posted an extremely fast implementation of the Fibonacci function. The source code may seem complex, but the resulting performance gain is well worth the extra complexity. I believe that readers who will read this utility’s source code and understand its underlying logic will gain insight into several performance tricks that could be very useful in general.

Effects of incremental JIT improvements

The introduction of JIT Acceleration in Matlab 6.5 (R13) caused a dramatic boost in performance (there is an internal distinction between the Accelerator and JIT: JIT is apparently only part of the Matlab Accelerator, but this distinction appears to have no practical impact on the discussion here).
Over the years, MathWorks has consistently improved the efficiency of its computational engine and the JIT Accelerator in particular. JIT was consistently improved since that release, giving a small improvement with each new Matlab release. In Matlab 7.11 (R2010b), the short Fibonacci snippet used in last week’s article showed executed about 30% faster compared to Matlab 7.1 R14SP3. The behavior was still quadratic in nature, and so in these releases, using any of the above-mentioned solutions could prove very beneficial.
In Matlab 7.12 (R2011a), a major improvement was done in the Matlab engine (JIT?). The execution run-times improved significantly, and in addition have become linear in nature. This means that multiplying the array size by N only degrades performance by N, not N² – an impressive achievement:

% This is ran on Matlab 7.14 (R2012a):
clear f, tic, f=[0,1]; for idx=3:10000, f(idx)=f(idx-1)+f(idx-2); end, toc
   => Elapsed time is 0.004924 seconds.  % baseline loop size, & exec time
clear f, tic, f=[0,1]; for idx=3:20000, f(idx)=f(idx-1)+f(idx-2); end, toc
   => Elapsed time is 0.009971 seconds.  % x2 loop size, x2 execution time
clear f, tic, f=[0,1]; for idx=3:40000, f(idx)=f(idx-1)+f(idx-2); end, toc
   => Elapsed time is 0.019282 seconds.  % x4 loop size, x4 execution time

In fact, it turns out that using either the cell arrays method or the factor growth mechanism is much slower in R2011a than using the naïve dynamic growth!
This teaches us a very important lesson: It is not wise to program against a specific implementation of the engine, at least not in the long run. While this may yield performance benefits on some Matlab releases, the situation may well be reversed on some future release. This might force us to retest, reprofile and potentially rewrite significant portions of code for each new release. Obviously this is not a maintainable solution. In practice, most code that is written on some old Matlab release would likely we carried over with minimal changes to the newer releases. If this code has release-specific tuning, we could be shooting ourselves in the leg in the long run.
MathWorks strongly advises (and again, and once again), and I concur, to program in a natural manner, rather than in a way that is tailored to a particular Matlab release (unless of course we can be certain that we shall only be using that release and none other). This will improve development time, maintainability and in the long run also performance.
(and of course you could say that a corollary lesson is to hurry up and get the latest Matlab release…)

Variants for array growth

If we are left with using a naïve dynamic resize, there are several equivalent alternatives for doing this, having significantly different performances:

% This is ran on Matlab 7.12 (R2011a):
% Variant #1: direct assignment into a specific out-of-bounds index
data=[]; tic, for idx=1:100000; data(idx)=1; end, toc
   => Elapsed time is 0.075440 seconds.
% Variant #2: direct assignment into an index just outside the bounds
data=[]; tic, for idx=1:100000; data(end+1)=1; end, toc
   => Elapsed time is 0.241466 seconds.    % 3 times slower
% Variant #3: concatenating a new value to the array
data=[]; tic, for idx=1:100000; data=[data,1]; end, toc
   => Elapsed time is 22.897688 seconds.   % 300 times slower!!!

As can be seen, it is much faster to directly index an out-of-bounds element as a means to force Matlab to enlarge a data array, rather than using the end+1 notation, which needs to recalculate the value of end each time.
In any case, try to avoid using the concatenation variant, which is significantly slower than either of the other two alternatives (300 times slower in the above example!). In this respect, there is no discernible difference between using the [] operator or the cat() function for the concatenation.
Apparently, the JIT performance boost gained in Matlab R2011a does not work for concatenation. Future JIT improvements may possibly also improve the performance of concatenations, but in the meantime it is better to use direct indexing instead.
The effect of the JIT performance boost is easily seen when we run the same variants on pre-R2011a Matlab releases. The corresponding values are 30.9, 34.8 and 34.3 seconds. Using direct indexing is still the fastest approach, but concatenation is now only 10% slower, not 300 times slower.
When we need to append a non-scalar element (for example, a 2D matrix) to the end of an array, we might think that we have no choice but to use the slow concatenation method. This assumption is incorrect: we can still use the much faster direct-indexing method, as shown below (notice the non-linear growth in execution time for the concatenation variant):

% This is ran on Matlab 7.12 (R2011a):
matrix = magic(3);
% Variant #1: direct assignment – fast and linear cost
data=[]; tic, for idx=1:10000; data(:,(idx*3-2):(idx*3))=matrix; end, toc
   => Elapsed time is 0.969262 seconds.
data=[]; tic, for idx=1:100000; data(:,(idx*3-2):(idx*3))=matrix; end, toc
   => Elapsed time is 9.558555 seconds.
% Variant #2: concatenation – much slower, quadratic cost
data=[]; tic, for idx=1:10000; data=[data,matrix]; end, toc
   => Elapsed time is 2.666223 seconds.
data=[]; tic, for idx=1:100000; data=[data,matrix]; end, toc
   => Elapsed time is 356.567582 seconds.

As the size of the array enlargement element (in this case, a 3×3 matrix) increases, the computer needs to allocate more memory space more frequently, thereby increasing execution time and the importance of preallocation. Even if the system has an internal memory-management mechanism that enables it to expand into adjacent (contiguous) empty memory space, as the size of the enlargement grows the empty space will run out sooner and a new larger memory block will need to be allocated more frequently than in the case of small incremental enlargements of a single 8-byte double.

Other alternatives

If preallocation is not possible, JIT is not very helpful, vectorization is out of the question, and rewriting the problem so that it doesn’t need dynamic array growth is impossible – if all these are not an option, then consider using one of the following alternatives for array growth (read again the interesting newsgroup thread from 2005 about this issue):

Dynamically grow the array by a certain percentage factor each time the array runs out of space (on-the-fly preallocation)
Use John D’Errico’s growdata utility
Use cell arrays to store and grow the data, then use cell2mat to convert the resulting cell array to a regular numeric array
Reuse an existing data array that has the necessary storage space
Wrap the data in a referential object (a class object that inherits from handle), then append the reference handle rather than the original data (ref). Note that if your class object does not inherit from handle, it is not a referential object but rather a value object, and as such it will be appended in its entirety to the array data, losing any performance benefits. Of course, it may not always be possible to wrap our class objects as a handle.
References have a much small memory footprint than the objects that they reference. The objects themselves will remain somewhere in memory and will not need to be moved whenever the data array is enlarged and reallocated – only the small-footprint reference will be moved, which is much faster. This is also the reason that cell concatenation is faster than array concatenations for large objects.

The post Array resizing performance appeared first on Undocumented Matlab.

Preallocation performance

Yair Altman — Wed, 16 May 2012 12:14:46 +0000

Array preallocation is a standard and quite well-known technique for improving Matlab loop run-time performance. Today’s article will show that there is more than meets the eye for even such a simple coding technique.
A note of caution: in the examples that follow, don’t take any speedup as an expected actual value – the actual value may well be different on your system. Your mileage may vary. I only mean to display the relative differences between different alternatives.

The underlying problem

Memory management has a direct influence on performance. I have already shown some examples of this in past articles here.
Preallocation solves a basic problem in simple program loops, where an array is iteratively enlarged with new data (dynamic array growth). Unlike other programming languages (such as C, C++, C# or Java) that use static typing, Matlab uses dynamic typing. This means that it is natural and easy to modify array size dynamically during program execution. For example:

fibonacci = [0, 1];
for idx = 3 : 100
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end

While this may be simple to program, it is not wise with regards to performance. The reason is that whenever an array is resized (typically enlarged), Matlab allocates an entirely new contiguous block of memory for the array, copying the old values from the previous block to the new, then releasing the old block for potential reuse. This operation takes time to execute. In some cases, this reallocation might require accessing virtual memory and page swaps, which would take an even longer time to execute. If the operation is done in a loop, then performance could quickly drop off a cliff.
The cost of such naïve array growth is theoretically quadratic. This means that multiplying the number of elements by N multiplies the execution time by about N². The reason for this is that Matlab needs to reallocate N times more than before, and each time takes N times longer due to the larger allocation size (the average block size multiplies by N), and N times more data elements to copy from the old to the new memory blocks.
A very interesting discussion of this phenomenon and various solutions can be found in a newsgroup thread from 2005. Three main solutions were presented: preallocation, selective dynamic growth (allocating headroom) and using cell arrays. The best solution among these in terms of ease of use and performance is preallocation.

The basics of pre-allocation

The basic idea of preallocation is to create a data array in the final expected size before actually starting the processing loop. This saves any reallocations within the loop, since all the data array elements are already available and can be accessed. This solution is useful when the final size is known in advance, as the following snippet illustrates:

% Regular dynamic array growth:
tic
fibonacci = [0,1];
for idx = 3 : 40000
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end
toc
   => Elapsed time is 0.019954 seconds.
% Now use preallocation – 5 times faster than dynamic array growth:
tic
fibonacci = zeros(40000,1);
fibonacci(1)=0; fibonacci(2)=1;
for idx = 3 : 40000,
   fibonacci(idx) = fibonacci(idx-1) + fibonacci(idx-2);
end
toc
   => Elapsed time is 0.004132 seconds.

On pre-R2011a releases the effect of preallocation is even more pronounced: I got a 35-times speedup on the same machine using Matlab 7.1 (R14 SP3). R2011a (Matlab 7.12) had a dramatic performance boost for such cases in the internal accelerator, so newer releases are much faster in dynamic allocations, but preallocation is still 5 times faster even on R2011a.

Non-deterministic pre-allocation

Because the effect of preallocation is so dramatic on all Matlab releases, it makes sense to utilize it even in cases where the data array’s final size is not known in advance. We can do this by estimating an upper bound to the array’s size, preallocate this large size, and when we’re done remove any excess elements:

% The final array size is unknown – assume 1Kx3K upper bound (~23MB)
data = zeros(1000,3000);  % estimated maximal size
numRows = 0;
numCols = 0;
while (someCondition)
   colIdx = someValue1;   numCols = max(numCols,colIdx);
   rowIdx = someValue2;   numRows = max(numRows,rowIdx);
   data(rowIdx,colIdx) = someOtherValue;
end
% Now remove any excess elements
data(:,numCols+1:end) = [];   % remove excess columns
data(numRows+1:end,:) = [];   % remove excess rows

Variants for pre-allocation

It turns out that MathWorks’ official suggestion for preallocation, namely using the zeros function, is not the most efficient:

% MathWorks suggested variant
clear data1, tic, data1 = zeros(1000,3000); toc
   => Elapsed time is 0.016907 seconds.
% A much faster alternative - 500 times faster!
clear data1, tic, data1(1000,3000) = 0; toc
   => Elapsed time is 0.000034 seconds.

The reason for the second variant being so much faster is because it only allocates the memory, without worrying about the internal values (they get a default of 0, false or ”, in case you wondered). On the other hand, zeros has to place a value in each of the allocated locations, which takes precious time.
In most cases the differences are immaterial since the preallocation code would only run once in the program, and an extra 17ms isn’t such a big deal. But in some cases we may have a need to periodically refresh our data, where the extra run-time could quickly accumulate.
Update (October 27, 2015): As Marshall notes below, this behavior changed in R2015b when the new LXE (Matlab’s new execution engine) replaced the previous engine. In R2015b, the zeros function is faster than the alternative of just setting the last array element to 0. Similar changes may also have occurred to the following post content, so if you are using R2015b onward, be sure to test carefully on your specific system.

Pre-allocating non-default values

When we need to preallocate a specific value into every data array element, we cannot use Variant #2. The reason is that Variant #2 only sets the very last data element, and all other array elements get assigned the default value (0, ‘’ or false, depending on the array’s data type). In this case, we can use one of the following alternatives (with their associated timings for a 1000×3000 data array):

scalar = pi;  % for example...
data = scalar(ones(1000,3000));           % Variant A: 87.680 msecs
data(1:1000,1:3000) = scalar;             % Variant B: 28.646 msecs
data = repmat(scalar,1000,3000);          % Variant C: 17.250 msecs
data = scalar + zeros(1000,3000);         % Variant D: 17.168 msecs
data(1000,3000) = 0; data = data+scalar;  % Variant E: 16.334 msecs

As can be seen, Variants C-E are about twice as fast as Variant B, and 5 times faster than Variant A.

Pre-allocating non-double data

7.4.5 Preallocating non-double data
When preallocating an array of a type that is not double, we should be careful to create it using the desired type, to prevent memory and/or performance inefficiencies. For example, if we need to process a large array of small integers (int8), it would be inefficient to preallocate an array of doubles and type-convert to/from int8 within every loop iteration. Similarly, it would be inefficient to preallocate the array as a double type and then convert it to int8. Instead, we should create the array as an int8 array in the first place:

% Bad idea: allocates 8MB double array, then converts to 1MB int8 array
data = int8(zeros(1000,1000));   % 1M elements
   => Elapsed time is 0.008170 seconds.
% Better: directly allocate the array as a 1MB int8 array – x80 faster
data = zeros(1000,1000,'int8');
   => Elapsed time is 0.000095 seconds.

Pre-allocating cell arrays

To preallocate a cell-array we can use the cell function (explicit preallocation), or the maximal cell index (implicit preallocation). Explicit preallocation is faster than implicit preallocation, but functionally equivalent (Note: this is contrary to the experience with allocation of numeric arrays and other arrays):

% Variant #1: Explicit preallocation of a 1Kx3K cell array
data = cell(1000,3000);
   => Elapsed time is 0.004637 seconds.
% Variant #2: Implicit preallocation – x3 slower than explicit
clear('data'), data{1000,3000} = [];
   => Elapsed time is 0.012873 seconds.

Pre-allocating arrays of structs

To preallocate an array of structs or class objects, we can use the repmat function to replicate copies of a single data element (explicit preallocation), or just use the maximal data index (implicit preallocation). In this case, unlike the case of cell arrays, implicit preallocation is much faster than explicit preallocation, since the single element does not actually need to be copied multiple times (ref):

% Variant #1: Explicit preallocation of a 100x300 struct array
element = struct('field1',magic(2), 'field2',{[]});
data = repmat(element, 100, 300);
   => Elapsed time is 0.002804 seconds.
% Variant #2: Implicit preallocation – x7 faster than explicit
element = struct('field1',magic(2), 'field2',{[]});
clear('data'), data(100,300) = element;
   => Elapsed time is 0.000429 seconds.

When preallocating structs, we can also use a third variant, using the built-in struct feature of replicating the struct when the struct function is passed a cell array. For example, struct('field1',cell(100,1), 'field2',5) will create 100 structs, each of them having the empty field field1 and another field called field2 with value 5. Unfortunately, this variant is slower than both of the previous variants.

Pre-allocating class objects

When preallocating in general, ensure that you are using the maximal expected array size. There is no point in preallocating an empty array or an array having a smaller size than the expected maximum, since dynamic memory reallocation will automatically kick-in within the processing-loop. For this reason, do not use the empty() method of class objects to preallocate, but rather repmat as explained above.
When using repmat to replicate class objects, always be careful to note whether you are replicating the object itself (this happens if your class does NOT derive from handle) or its reference handle (which happens if you derive the class from handle). If you are replicating objects, then you can safely edit any of their properties independently of each other; but if you replicate references, you are merely using multiple copies of the same reference, so that modifying referenced object #1 will also automatically affect all the other referenced objects. This may or may not be suitable for your particular program requirements, so be careful to check carefully. If you actually need to use independent object copies, you will need to call the class constructor multiple times, once for each new independent object.

Next week: what if we can’t avoid dynamic array resizing? – apparently, all is not lost. Stay tuned…

Do you have any similar allocation-related tricks you’re using? or unexpected differences such as the ones shown above? If so, then please do post a comment.

The post Preallocation performance appeared first on Undocumented Matlab.

Memory – Undocumented Matlab

Blocked wait with timeout for asynchronous events

The basic idea

Using a stand-alone generic signaling object

Related posts:

Quirks with parfor vs. for

Data serialization quirks

Data precision quirks

Upcoming travels - Zürich & Geneva

Related posts:

General-use object copy

Related posts:

New book: Accelerating MATLAB Performance

Related posts:

Allocation performance take 2

Related posts:

File deletion memory leaks, performance

Memory leak

Performance

Related posts:

The Java import directive

Related posts:

Internal Matlab memory optimizations

Copy on Write (COW, Lazy Copy)

1. Regular variable copies

2. Function input parameters

In-place data manipulation

Related posts:

Array resizing performance

Factor growth: dynamic allocation by chunks

Using cell arrays

The growdata utility

Effects of incremental JIT improvements

Variants for array growth

Other alternatives

Related posts:

Preallocation performance

The underlying problem

The basics of pre-allocation

Non-deterministic pre-allocation

Variants for pre-allocation

Pre-allocating non-default values

Pre-allocating non-double data

Pre-allocating cell arrays

Pre-allocating arrays of structs

Pre-allocating class objects

Related posts: