Quirks with parfor vs. for

January 5, 2017 7 Comments

A few months ago, I discussed several tips regarding Matlab’s parfor command, which is used by the Parallel Computing Toolbox (PCT) for parallelizing loops. Today I wish to extend that post with some unexplained oddities when using parfor, compared to a standard for loop.

Data serialization quirks

Dimitri Shvorob may not appear at first glance to be a prolific contributor on Matlab Central, but from the little he has posted over the years I regard him to be a Matlab power-user. So when Dimitri reports something, I take it seriously. Such was the case several months ago, when he contacted me regarding very odd behavior that he saw in his code: the for loop worked well, but the parfor version returned different (incorrect) results. Eventually, Dimitry traced the problem to something originally reported by Dan Austin on his Fluffy Nuke It blog.
The core issue is that if we have a class object that is used within a for loop, Matlab can access the object directly in memory. But with a parfor loop, the object needs to be serialized in order to be sent over to the parallel workers, and deserialized within each worker. If this serialization/deserialization process involves internal class methods, the workers might see a different version of the class object than the one seen in the serial for loop. This could happen, for example, if the serialization/deserialization method croaks on an error, or depends on some dynamic (or random) conditions to create data.
In other words, when we use data objects in a parfor loop, the data object is not necessarily sent “as-is”: additional processing may be involved under the hood that modify the data in a way that may be invisible to the user (or the loop code), resulting in different processing results of the parallel (parfor) vs. serial (for) loops.
For additional aspects of Matlab serialization/deserialization, see my article from 2 years ago (and its interesting feedback comments).

Data precision quirks

The following section was contributed by guest blogger Lior Perlmuter-Shoshany, head algorithmician at a private equity fund.
In my work, I had to work with matrixes in the order of 10⁹ cells. To reduce the memory footprint (and hopefully also improve performance), I decided to work with data of type single instead of Matlab’s default double. Furthermore, in order to speed up the calculation I use parfor rather than for in the main calculation. In the end of the run I am running a mini for-loop to see the best results.
What I discovered to my surprise is that the results from the parfor and for loop variants is not the same!

The following simplified code snippet illustrate the problem by calculating a simple standard-deviation (std) over the same data, in both single– and double-precision. Note that the loops are ran with only a single iteration, to illustrate the fact that the problem is with the parallelization mechanism (probably the serialization/deserialization parts once again), not with the distribution of iterations among the workers.

clear
rng('shuffle','twister');
% Prepare the data in both double and single precision
arr_double = rand(1,100000000);
arr_single = single(arr_double);
% No loop - direct computation
std_single0 = std(arr_single);
std_double0 = std(arr_double);
% Loop #1 - serial for loop
std_single = 0;
std_double = 0;
for i=1
    std_single(i) = std(arr_single);
    std_double(i) = std(arr_double);
end
% Loop #2 - parallel parfor loop
par_std_single = 0;
par_std_double = 0;
parfor i=1
    par_std_single(i) = std(arr_single);
    par_std_double(i) = std(arr_double);
end
% Compare results of for loop vs. non-looped computation
isForSingleOk = isequal(std_single, std_single0)
isForDoubleOk = isequal(std_double, std_double0)
% Compare results of single-precision data (for vs. parfor)
isParforSingleOk = isequal(std_single, par_std_single)
parforSingleAccuracy = std_single / par_std_single
% Compare results of double-precision data (for vs. parfor)
isParforDoubleOk = isequal(std_double, par_std_double)
parforDoubleAccuracy = std_double / par_std_double

Output example :

isForSingleOk =
    1                   % <= true (of course!)
isForDoubleOk =
    1                   % <= true (of course!)
isParforSingleOk =
    0                   % <= false (odd!)
parforSingleAccuracy =
    0.73895227413361    % <= single-precision results are radically different in parfor vs. for
isParforDoubleOk =
    0                   % <= false (odd!)
parforDoubleAccuracy =
    1.00000000000021    % <= double-precision results are almost [but not exactly] the same in parfor vs. for

From my testing, the larger the data array, the bigger the difference is between the results of single-precision data when running in for vs. parfor.
In other words, my experience has been that if you have a huge data matrix, it’s better to parallelize it in double-precision if you wish to get [nearly] accurate results. But even so, I find it deeply disconcerting that the results are not exactly identical (at least on R2015a-R2016b on which I tested) even for the native double-precision .
Hmmm… bug?

Upcoming travels – Zürich & Geneva

I will shortly be traveling to clients in Zürich and Geneva, Switzerland. If you are in the area and wish to meet me to discuss how I could bring value to your work with some advanced Matlab consulting or training, then please email me (altmany at gmail):

Zürich: January 15-17
Geneva: January 18-21

Happy new year everybody!

7 Responses

zed January 6, 2017 at 17:07 Reply
I had to deal with the parfor over object arrays ‘challenge’; when processing by invoking a method that operates on internal properties it doesn’t bother returning the properties. I think this is a bad bug, and a terrible oversight when they added both the new object concept and the parallel toolbox. I add an extra property called “returnVal” to the class to touch, then it does reliably return all of the properties.
parfor ii=1:10; q(ii).process; q(ii).returnVal=ii; end
parfor ii=1:10; q(ii).process; q(ii).returnVal=ii; end
I end up debugging this over and over though.

–zed.
James January 6, 2017 at 17:10 Reply

Try again with -singleCompThread

Yair Altman January 7, 2017 at 17:58 Reply

@James – the entire point of using parfor is not to use a single computation thread but rather run in parallel over multiple workers. The code in the article was simply illustrating this point using a single loop iteration, but the problem itself is also evident in multi-workers processing multi-iteration loops. In all cases one would reasonably expect Matlab to return numerically accurate results.

Gabriele January 9, 2017 at 15:16 Reply

Thanks for the very useful post!

Perhaps I’m wrong, but it seems to be something related to intrinsic parallelization combined with single accuracy. In fact, as a basis, the main console is multi-thread. On the other hand workers are started, as a basis, using a single thread. Changing from multi-thread to single-thread computations in the console, I can reproduce the issue you have identified with parfor.

This example reproduces the issue from within the console:

clear
rng('shuffle','twister');
 
NTO = maxNumCompThreads; %number of threads in console - it is assumed that NTO>1
 
arr_double = rand(1,10000000*5);
arr_single = single(arr_double);
 
std_double_MT = std(arr_double);
std_single_MT = std(arr_single);
 
%go to single thread:
maxNumCompThreads(1);
 
std_double_ST = std(arr_double);
std_single_ST = std(arr_single);
 
%go back to original number of threads:
maxNumCompThreads(NTO);
 
%now use a parfor:
std_double_PF = NaN;
std_single_PF = NaN;
parfor i=1
    std_double_PF(i) = std(arr_double);
    std_single_PF(i) = std(arr_single);
end;
 
fprintf('nnResults from console with intrinsic parallelization (multi-thread):');
fprintf('nstd_double_MT=%f',std_double_MT);
fprintf('nstd_single_MT=%f',std_single_MT);
 
fprintf('nnResults from console with single thread:');
fprintf('nstd_double_ST=%f',std_double_ST);
fprintf('nstd_single_ST=%f',std_single_ST);
 
fprintf('nnResults from parfor:');
fprintf('nstd_double_PF=%f',std_double_PF);
fprintf('nstd_single_PF=%f',std_single_PF);
 
fprintf('nnAnalytical result:');
fprintf('n       std_an=%f',sqrt(1/12));

———————————————————

Outcome in my case (I have NTO=4):

Results from console with intrinsic parallelization (multi-thread):
std_double_MT=0.288674
std_single_MT=0.286997
 
Results from console with single thread:
std_double_ST=0.288674
std_single_ST=0.309665
 
Results from parfor:
std_double_PF=0.288674
std_single_PF=0.309665
 
Analytical result:
       std_an=0.288675

Gabriele January 11, 2017 at 11:12 Reply

Please, let me come back on this interesting matter.

After some checks, I think it could be a matter associated with loss of significance in case of calculations (like std, mean, sum, etc.) which involve summing up relatively small numbers (here we are randomly generating from U(0,1)) to (partial) summations which are large with respect to the available precision.

In case of multi-thread computations, since the summations are (likely) carried out in separate pieces, the effect is seen “later” in terms of dimension of the array under analysis. Instead, in single-thread calculations this is seen “before”.

I have prepared a script + a function which might be useful to show this effect (I used R2016a – change the first switch to change the test function):

— Script —

clear
rng('shuffle','twister');
 
switch 1, %change the function here
    case 1, %STD
        f=@(x) std(x); %function to be evaluated
        fname='std(x(1:i))';
    case 2, %MEAN
        f=@(x) mean(x); %function to be evaluated
        fname='mean(x(1:i))';
    case 3, %SUM
        f=@(x) sum(x); %function to be evaluated
        fname='sum(x(1:i))';        
end
 
NTO = maxNumCompThreads; %number of threads in console - it is assumed that NTO>1
 
Nd = 5*10^7; %Number of data - At least abt 4*10^7 is needed to see the problem in my case. Be very careful with RAM limitations (try smaller values first, and then increase progressively)! 
arr_double = rand(1,Nd);
arr_single = single(arr_double);
ind = round(linspace(100,Nd,100));
 
f_double_MT = f_test_progressive(f,ind,arr_double);
f_single_MT = f_test_progressive(f,ind,arr_single);
 
%go to single thread:
maxNumCompThreads(1);
 
f_double_ST = f_test_progressive(f,ind,arr_double);
f_single_ST = f_test_progressive(f,ind,arr_single);
 
%go back to original number of threads:
maxNumCompThreads(NTO);
 
%now use a parfor:
[f_double_PF, f_single_PF] = deal(cell(1));
parfor i = 1
    f_double_PF{i} = f_test_progressive(f,ind,arr_double);
    f_single_PF{i} = f_test_progressive(f,ind,arr_single);    
end
f_double_PF = f_double_PF{1};
f_single_PF = f_single_PF{1};
 
figure;
hold all;
plot(ind,f_double_MT,'-o', 'Displayname','Double-MT');
plot(ind,f_single_MT,'--o','Displayname','Single-MT');
plot(ind,f_double_ST,'-s', 'Displayname','Double-ST');
plot(ind,f_single_ST,'--s','Displayname','Single-ST');
plot(ind,f_double_PF,'-d', 'Displayname','Double-PF');
plot(ind,f_single_PF,'--d','Displayname','Single-PF');
xlabel('Progressive index - i');
ylabel(fname);
legend('Location','Best');

— End script —

— Support function —

function out = f_test_progressive(fh,ind,x)
   ind = unique(ind);
   Ntest = numel(ind);
 
   switch class(x),
      case 'double',  out = nan(1,Ntest);
      case 'single',  out = single(nan(1,Ntest));
      otherwise,      error('f_test_progressive:ErrorDataType','Data type not allowed');                
   end
 
   for jt=1:Ntest
      out(jt) = fh(x(1:ind(jt)));
   end
end

— End support function —

Basically a function specified by handle “fh” (first switch) is calculated on the first ind(jt) elements of the array, considering single/double precision, and single-/multi-thread, as well as parfor (by default single-thread).

In my case, with single-thread, I see the issue occurring particularly with single-precision, single-thread calculations. It is interesting to see how the effect is different depending on the specific function which is evaluated. “std”, in my case, is particularly problematic, while “mean” and “sum” in single-precision/single-thread basically agree with double precision results up to a dimension of about 3.3e7 (in my case).

I hope it can be useful.
Gabriele

Xiangrui Li January 7, 2017 at 17:51 Reply

It seems a serious bug. I tested mean instead of std, and the result is similar. Maybe the error for std is due to the error for mean?

I played with different sizes for arr_double: [2e7 5e7 8e7 1e8 2e8],
parforSingleAccuracy-1 gives following:
[0.0001 0.4901 1.3841 1.9798 0]

parforDoubleAccuracy-1 gives following:
[0.1350 0.1619 0.1648 0.4738 0] * 1e-12

The last (for size 2e8) is zero because it exceeds memory, and matlab gives a warning and switches to local computation.
Bobby Cheng January 26, 2017 at 16:16 Reply
The observation about the difference in computed results between for and parfor loop are down to the limitation of floating point arithmetic. There are two main issues here.

1) The different result from for and parfor loops for double precision inputs.
2) The loss of precision in single precision std.

One basic fact about floating point arithmetic is that it is not associated. That is, order of operations matters. For example, (1e-20 + 1) – 1 returns 0 while 1e-20 + (1 – 1) would return 1e-20. In the case of for vs parfor, the different number of threads used by MATLAB in the two loop settings change the order of floating point operations. This cause the results to be different. There is nothing wrong with parfor. For double precision, both computed results are numerically correct. As a good practice with floating point arithmetic, we should always test the computed results to be accurate within a tolerance rather than testing for equality.

As to the complete loss of precision in std for single precision data, we can trace it back to the computation of sum. The root of the problem is that the current algorithm used by sum simply does not have enough precision to cope with the amount of single precision data. This makes the results seemingly sensitive to the number of computational threads. I will explain more below. But first I would admit that MATLAB needs to do better and we plan to improve the algorithm to better cope with this amount of data in a future release.

To understand what is happening, consider the case of a vector of all ones. If you simply summing up the numbers, like this
x = ones(1, 1e8, 'single'); s = 0; for k = 1:numel(x) s = s + x(k); end >> s s = single 16777216
x = ones(1, 1e8, 'single'); s = 0; for k = 1:numel(x) s = s + x(k); end >> s s = single 16777216
The number 16777216 is not a random integer. It is 2^24. This simple implementation of sum starts to lose precision after 2^24 (more than 16 million) iterations. This is because the distance between single(2^24) and the next larger floating point number is 2. Adding 1 to single(2^24) would result in single(2^24) due to round off error. This is what you would get with sum if you use only one computational thread, like what happens when you use parfor.
>> maxNumCompThreads(1); >> sum(x) ans = single 16777216
>> maxNumCompThreads(1); >> sum(x) ans = single 16777216
When you run sum in a for loop, the computation will be multithreaded and you will see a different set of round off error. Look at the output of sum when running with different number of threads.
for k = [1 2 4 8] maxNumCompThreads(k); sum(x) end ans = single 16777216 ans = single 33554432 ans = single 67108864 ans = single 100000000
for k = [1 2 4 8] maxNumCompThreads(k); sum(x) end ans = single 16777216 ans = single 33554432 ans = single 67108864 ans = single 100000000
Only 8 threads gets the right answer because each thread gets 12500000 elements, which is less than 2^24 elements, so no error in the partial sum, and no error when combining the partial sums from each thread.

This problem is not unique to single precision. The limitation is present for double precision as well, however, since there is more than twice the precision, it takes substantially more numbers to see the issue.

Regards,
Bobby Cheng, MathWorks.

HTML tags such as <b> or <i> are accepted.
Wrap code fragments inside <pre lang="matlab"> tags, like this:

<pre lang="matlab">
a = magic(3);
disp(sum(a))
</pre>

I reserve the right to edit/delete comments (read the site policies).
Not all comments will be answered. You can always email me (altmany at gmail) for private consulting.

Click here to cancel reply.

Data serialization quirks

Data precision quirks

Upcoming travels – Zürich & Geneva

Related posts:

7 Responses

Leave a Reply