A few days ago, MathWorks uploaded a video recording of my recent keynote presentation at the Matlab Expo 2016 in Munich, Germany. During the presentation, I skimmed over a few tips for improving performance of parallel-processing (parfor) loops. In today’s post I plan to expand on these tips, as well as provide a few others that for lack of space and time I did not mention in the presentation.
The overall effect can be dramatic: The performance (speed) difference between a sub-optimal and optimized parfor‘ed code can be up to a full order of magnitude, depending on the specific situation. Naturally, to use any of today’s tips, you need to have MathWorks’ Parallel Computing Toolbox (PCT).
Before diving into the technical details, let me say that MathWorks has extensive documentation on PCT. In today’s post I will try not to reiterate the official tips, but rather those that I have not found mentioned elsewhere, and/or are not well-known (my apologies in advance if I missed an official mention of one or more of the following). Furthermore, I limit myself only to parfor in this post: much can be said about spmd, GPU and other parallel constructs, but not today.
parpool(numCores)
The first tip is to not [always] use the default number of workers created by parpool (or matlabpool in R2013a or earlier). By default, Matlab creates as many workers as logical CPU cores. On Intel CPUs, the OS reports two logical cores per each physical core due to hyper-threading, for a total of 4 workers on a dual-core machine. However, in many situations, hyperthreading does not improve the performance of a program and may even degrade it (I deliberately wish to avoid the heated debate over this: you can find endless discussions about it online and decide for yourself). Coupled with the non-negligible overhead of starting, coordinating and communicating with twice as many Matlab instances (workers are headless [=GUI-less] Matlab processes after all), we reach a conclusion that it may actually be better in many cases to use only as many workers as physical (not logical) cores.
I know the documentation and configuration panel seem to imply that parpool uses the number of physical cores by default, but in my tests I have seen otherwise (namely, logical cores). Maybe this is system-dependent, and maybe there is a switch somewhere that controls this, I don’t know. I just know that in many cases I found it beneficial to reduce the number of workers to the actual number of physical cores:
p = parpool; % use as many workers as logical CPUs (4 on my poor laptop...) p = parpool(2); % use only 2 parallel workers |
Of course, this can vary greatly across programs and platforms, so you should test carefully on your specific setup. I suspect that for the majority of Matlab programs it would turn out that using the number of physical cores is better.
It would of course be better to dynamically retrieve the number of physical cores, rather than hard-coding a constant value (number of workers) into our program. We can get this value in Matlab using the undocumented feature(‘numcores’) function:
numCores = feature('numcores'); p = parpool(numCores); |
Running feature(‘numcores’) without assigning its output displays some general debugging information:
>> feature('numcores')
MATLAB detected: 2 physical cores.
MATLAB detected: 4 logical cores.
MATLAB was assigned: 4 logical cores by the OS.
MATLAB is using: 2 logical cores.
MATLAB is not using all logical cores because hyper-threading is enabled.
ans =
2
Naturally, this specific tip is equally valid for both parfor loops and spmd blocks, since both of them use the pool of workers started by parpool.
Running separate code in parfor loops
The conventional wisdom is that parfor loops (and loops in general) can only run a single code segment over all its iterations. Of course, we can always use conditional constructs (such as if or switch) based on the data. But what if we wanted some workers to run a different code path than the other workers? In spmd blocks we could use a conditional based on the labindex value, but unfortunately labindex is always set to the same value 1 within parfor loops. So how can we let worker A run a different code path than worker B?
An obvious answer is to create a parfor loop having as many elements as there are separate code paths, and use a switch-case mechanism to run the separate paths, as follows:
% Naive implementation example - do NOT use! parfor idx = 1 : 3 switch idx case 1, result{1} = fun1(data1, data2); case 2, result{2} = fun2(data3, data4, data5); case 3, result{3} = fun3(data6); end end |
There are several problems with this naive implementation. First, it unnecessarily broadcasts all the input data to all workers (more about this issue below). Secondly, it appears clunky and too verbose. A very nice extension of this mechanism, posted by StackOverflow heavyweight Jonas, uses indexed arrays of function handles and input args, thereby solving both problems:
funcList = {@fun1, @fun2, @fun3}; dataList = {data1, data2, data3}; %# or pass file names parfor idx = 1 : length(funcList) result{idx} = funcList{idx}(dataList{idx}); end |
Reduce the amount of broadcast data
It is often easy, too-easy, to convert for loops into parfor loops. In many cases, all we need to do is to add the “par” prefix to the for keyword and we’re done (assuming we have no incompatibly-used variables that should be converted into sliced variables etc.). This transformation was intentionally made simple by MathWorks (which is great!). On the other hand, it also hides a lot under the hood. One of the things that is often overlooked in such simple loop transformations is that a large part of the data used within the loop needs to be copied (broadcast) to each of the workers separately. This means that each of the data items needs to be serialized (i.e., copied in memory), packaged, communicated to and accepted by each of the workers. This can mean a lot of memory, networking bandwidth and time-consuming. It can even mean thrashing to hard-disk in case the number of workers times the amount of transferred data exceeds the available RAM. For example, if we have 10GB available RAM and try to communicate 3GB to 4 workers, we will not have enough RAM and the OS will start swapping to hard-disk. This will kill performance and Matlab will appear “hung” and will need to be hard-killed.
You might think that it would be very difficult to reach the RAM limit, but in fact it can be far too easy when you consider the multiplication by the number of workers, and the fact that each worker uses 1+GB of memory just for its MATLAB process, even before the data, and all this in addition to the parent (client) Matlab process. That’s a lot of GBs flying around…
Moreover, it’s enough for one small part of a Matlab struct or array to be used within the parfor loop for the entire Matlab construct to be broadcast to all workers. For example, a very common use-case is to store program data, both raw and processed, within a simple Matlab struct. Let’s say that we have data.raw
and data.processed
and within the loop we only need data.processed
– the entire data
variable (which might include many GBs due to the raw data) is broadcast, although the loop’s code only needs data.processed
. In such cases, it makes sense to separate the broadcast data into standalone variables, and only use them within the loop:
data.raw = ... data.processed = ... % Inefficient variant: parfor idx = 1 : N % do something with data.processed end % This is better: processedData = data.processed; parfor idx = 1 : N % do something with processedData end |
Moreover, if you can convert a broadcast variable into a sliced one, this would be even better: in this case each worker will only be communicated its small share (“slice”) of the entire data, rather than a full copy of the entire data.
All this would of course be much simpler if Matlab’s computational engine was multi-threaded, since then PCT could be implemented using lightweight threads rather than heavyweight processes. The memory and communication overheads would then be drastically reduced and performance would improve significantly. Unfortunately, Matlab’s computational engine is [still] single-threaded, preventing this. Hopefully Matlab’s new engine (which debuted in R2015b) will enable true multithreading at some future release. PCT will still need to retain an option of using headless worker processes to run on multiple machines (i.e., distributed/grid/cloud computing), but single-machine parallelization should employ multithreading instead.
Additional speedup tips can be found in my book “Accelerating MATLAB Performance“.
Do you have some other important parfor tips that you found useful? If so, please post them in a comment below.
Yair, I’m pretty sure that if you find a case where parpool is using the number of logical (rather than physical) cores as its default number of workers, that’s a bug. As you mentioned in your section on hyperthreading, there are good reasons why it should be the number of physical cores.
I’ve always felt that the documentation could be clearer on that area: perhaps even providing a little tutorial on what hyperthreading is, the difference between logical and physical cores, and why it’s very likely that the number of physical cores is the relevant number.
I know I’ve answered quite a few questions on StackOverflow that were based on misunderstandings related to these topics. I’ve also worked with customers who were confused when their machine appeared to working at full tilt, but Task Manager showed each core at half-potential, because they had hyperthreading switched on.
I think the example above, where you want to run multiple functions, might be better fulfilled by parfeval.
My experience is that only having workers corresponding to the physical cores does not cause the processor (several i7 and a pair of Xeon e5-2630v2) to clock up to full speed. Doubling to the amount of logical cores, however, causes the processor to increase its clock speed to the highest. It may very well be a bad performance metric, but I haven’t looked in to the actual performance.
Yair, can you elaborate a bit on less intuitive notion that the best performance is achieved with lower than maximum clock frequency?
I have used Intel’s performance counter monitor on Linux to see what’s going on:
https://software.intel.com/en-us/articles/intel-performance-counter-monitor
I meant to say “Doubling to the amount workers to the number of logical cores”. 🙂
@Jes, perhaps a possible explanation is that by using logical rather than physical cores, the cores utilize hyperthreading and perhaps in your specific case this has a greater advantage than the drawbacks of the extra overheads of parallelization management, OS context-switching, and extra memory. Using extra parallel processes could also help when the processes have a significant I/O portion, since during the I/O wait the CPU is idle and can therefore be used by the non-I/O portions of other parallel processes. In short, this is highly system-, program- and data-dependent. In the general case I found that limiting the number of processes to the number of physical cores is better than the default limit to the number of logical cores, but in specific cases it may well be different. This is easy to test on your specific system/program.
If you have heterogenous workloads, i.e. some of the parfor tasks are quick and some are slow, and they are clustered together, e.g. the slow ones are at the start, try putting your work into an array, then randomizing the array:
If you have *very* heterogeneous workloads, such that some of the parfor elements will take a long time, and some will be ‘0 work’, it may be more efficient to compute which elements have ‘0 work’ beforehand and remove these elements from the work. Otherwise the very heterogeneous workloads won’t balance well across the cores, and you may well be left with a few cores running long running tasks, while the others have completed.