I was recently asked by a consulting client to help speed up a Matlab process. Quite often there are various ways to improve the run-time, and in this particular case it turned out that the best option was to convert the core Matlab processing loop into a multi-threaded Mex function, while keeping the rest (vast majority of program code) in easy-to-maintain Matlab. This resulted in a 160x speedup (25 secs => 0.16 secs). Some of this speedup is attributed to C-code being faster in general than Matlab, another part is due to the multi-threading, and another due to in-place data manipulations that avoid costly memory access and re-allocations.
In today’s post I will share some of the insights relating to this MEX conversion, which could be adapted for many other similar use-cases. Additional Matlab speed-up techniques can be found in other performance-related posts on this website, as well in my book Accelerating MATLAB Performance.
There are quite a few online resources about creating Mex files, so I will not focus on this aspect. I’ll assume that the reader is already familiar with the concept of using Mex functions, which are simply dynamically-linked libraries that have a predefined entry-function syntax and predefined platform-specific extension. Instead, I’ll focus on how to create and debug a multi-threaded Mex function, so that it runs in parallel on all CPU cores.
The benefit of multi-threading is that threads are very light-weight objects, that have minimal performance and memory overheads. This contrasts to multi-tasking, which is what the Parallel Computing Toolbox currently does: launches duplicate copies of the entire Matlab engine process (“headless workers”) and then manages and coordinates the tasks to split up the processing work. Multi-tasking should be avoided wherever we can employ light-weight multi-threading instead. Unfortunately, Matlab does not currently have the ability to explicitly multi-thread Matlab code. But we can still use explicit multi-threading by invoking code in other languages, as I’ve already shown for Java, C# (and .NET in general), and C/C++. Today’s article will expand on the latter post (the one about C/C++ multi-threading), by showing a general framework for making a multi-threaded C-based Mex function.
There are several alternative implementation of threads. On non-Windows machines, POSIX threads (“pthreads”) are a de-facto standard; on Windows, which pthreads can still be used, they generally use native Windows threads under the hood, and we can use these native threads directly.
I have uploaded a file called max_in_place to the Matlab File Exchange. This function serves as an example showing a generic multi-threaded Mex function. A compiled version of this Mex file for Win64 is also included in the File Exchange entry, and you can run it directly on a Win64 machine.
The usage in Matlab is as follows (note how matrix1
is updated in-place):
>> matrix1 = rand(4) matrix1 = 0.89092 0.14929 0.81428 0.19664 0.95929 0.25751 0.24352 0.25108 0.54722 0.84072 0.92926 0.61604 0.13862 0.25428 0.34998 0.47329 >> matrix2 = rand(4) matrix2 = 0.35166 0.91719 0.38045 0.53081 0.83083 0.28584 0.56782 0.77917 0.58526 0.7572 0.075854 0.93401 0.54972 0.75373 0.05395 0.12991 >> max_in_place(matrix1, matrix2) >> matrix1 matrix1 = 0.89092 0.91719 0.81428 0.53081 0.95929 0.28584 0.56782 0.77917 0.58526 0.84072 0.92926 0.93401 0.54972 0.75373 0.34998 0.47329 |
Source code and discussion
The pseudo-code for the MEX function is as follows:
mexFunction(): validate the input/output args quick bail-out if empty inputs get the number of threads N from Matlab's maxNumCompThreads function if N == 1 run main_loop directly else split input matrix #1 into N index blocks assign start/end index for each thread create and launch N new threads that run main_loop wait for all N threads to complete free the allocated memory resources end |
Here’s the core source-code of this function, which was adapted from original work by Dirk-Jan Kroon:
/*==================================================================== * * max_in_place.c updates a data matrix in-place with the max value * of the matrix and a 2nd matrix of the same size * * The calling syntax is: * * max_in_place(matrix1, matrix2) * * matrix1 will be updated with the maximal values from corresponding * indices of the 2 matrices * * Both inputs must be double 2D real non-sparse matrices of same size * * Yair Altman 2018-07-18 * http://undocumentedmatlab.com/blog/multi-threaded-mex * * Adapted from original work by Dirk-Jan Kroon * http://mathworks.com/matlabcentral/profile/authors/1097878-dirk-jan-kroon * *==================================================================*/ #include <math.h> #include "mex.h" /* undef needed for LCC compiler */ #undef EXTERN_C #ifdef _WIN32 #include <windows.h> #include <process.h> #else #include <pthread.h> #endif /* Input Arguments */ #define hMatrix1 prhs[0] #define hMatrix2 prhs[1] /* Macros */ #if !defined(MAX) #define MIN(A, B) ((A) < (B) ? (A) : (B)) #endif /* Main processing loop function */ void main_loop(const mxArray *prhs[], int startIdx, int endIdx) { /* Assign pointers to the various parameters */ double *p1 = mxGetPr(hMatrix1); double *p2 = mxGetPr(hMatrix2); /* Loop through all matrix coordinates */ for (int idx=startIdx; idx<=endIdx; idx++) { /* Update hMatrix1 with the maximal value of hMatrix1,hMatrix2 */ if (p1[idx] < p2[idx]) { p1[idx] = p2[idx]; } } } /* Computation function in threads */ #ifdef _WIN32 unsigned __stdcall thread_func(void *ThreadArgs_) { #else void thread_func(void *ThreadArgs_) { #endif double **ThreadArgs = ThreadArgs_; /* void* => double** */ const mxArray** prhs = (const mxArray**) ThreadArgs[0]; int ThreadID = (int) ThreadArgs[1][0]; int startIdx = (int) ThreadArgs[2][0]; int endIdx = (int) ThreadArgs[3][0]; /*mexPrintf("Starting thread #%d: idx=%d:%d\n", ThreadID, startIdx, endIdx); */ /* Run the main processing function */ main_loop(prhs, startIdx, endIdx); /* Explicit end thread, helps to ensure proper recovery of resources allocated for the thread */ #ifdef _WIN32 _endthreadex( 0 ); return 0; #else pthread_exit(NULL); #endif } /* validateInputs function here... */ /* Main entry function */ void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { /* Validate the inputs */ validateInputs(nlhs, plhs, nrhs, prhs); /* Quick bail-out in the trivial case of empty inputs */ if (mxIsEmpty(hMatrix1)) return; /* Get the number of threads from the Matlab engine (maxNumCompThreads) */ mxArray *matlabCallOut[1] = {0}; mxArray *matlabCallIn[1] = {0}; mexCallMATLAB(1, matlabCallOut, 0, matlabCallIn, "maxNumCompThreads"); double *Nthreadsd = mxGetPr(matlabCallOut[0]); int Nthreads = (int) Nthreadsd[0]; /* Get the number of elements to process */ size_t n1 = mxGetNumberOfElements(hMatrix1); if (Nthreads == 1) { /* Process the inputs directly (not via a thread) */ main_loop(prhs, 0, n1-1); } else { /* multi-threaded */ /* Allocate memory for handles of worker threads */ #ifdef _WIN32 HANDLE *ThreadList = (HANDLE*) malloc(Nthreads*sizeof(HANDLE)); #else pthread_t *ThreadList = (pthread_t*)malloc(Nthreads*sizeof(pthread_t)); #endif /* Allocate memory for the thread arguments (attributes) */ double **ThreadID, **ThreadStartIdx, **ThreadEndIdx, ***ThreadArgs; double *ThreadID1, *ThreadStartIdx1, *ThreadEndIdx1, **ThreadArgs1; ThreadID = (double **) malloc( Nthreads* sizeof(double *) ); ThreadStartIdx = (double **) malloc( Nthreads* sizeof(double *) ); ThreadEndIdx = (double **) malloc( Nthreads* sizeof(double *) ); ThreadArgs = (double ***)malloc( Nthreads* sizeof(double **) ); /* Launch the requested number of threads */ int i; int threadBlockSize = ceil( ((double)n1) / Nthreads ); for (i=0; i<nthreads; i++) { /* Create thread ID */ ThreadID1 = (double *)malloc( 1* sizeof(double) ); ThreadID1[0] = i; ThreadID[i] = ThreadID1; /* Compute start/end indexes for this thread */ ThreadStartIdx1 = (double *) malloc( sizeof(double) ); ThreadStartIdx1[0] = i * threadBlockSize; ThreadStartIdx[i] = ThreadStartIdx1; ThreadEndIdx1 = (double *) malloc( sizeof(double) ); ThreadEndIdx1[0] = MIN((i+1)*threadBlockSize, n1) - 1; ThreadEndIdx[i] = ThreadEndIdx1; /* Prepare thread input args */ ThreadArgs1 = (double **) malloc( 4* sizeof(double*) ); ThreadArgs1[0] = (double *) prhs; ThreadArgs1[1] = ThreadID[i]; ThreadArgs1[2] = ThreadStartIdx[i]; ThreadArgs1[3] = ThreadEndIdx[i]; ThreadArgs[i] = ThreadArgs1; /* Launch the thread with its associated args */ #ifdef _WIN32 ThreadList[i] = (HANDLE)_beginthreadex(NULL, 0, &thread_func, ThreadArgs[i], 0, NULL); #else pthread_create ((pthread_t*)&ThreadList[i], NULL, (void *) &thread_func, ThreadArgs[i]); #endif } /* Wait for all the treads to finish working */ #ifdef _WIN32 for (i=0; i<nthreads; i++) { WaitForSingleObject(ThreadList[i], INFINITE); } for (i=0; i<nthreads; i++) { CloseHandle( ThreadList[i] ); } #else for (i=0; i<nthreads; i++) { pthread_join(ThreadList[i],NULL); } #endif /* Free the memory resources allocated for the threads */ for (i=0; i<nthreads; i++) { free(ThreadArgs[i]); free(ThreadID[i]); free(ThreadStartIdx[i]); free(ThreadEndIdx[i]); } free(ThreadArgs); free(ThreadID ); free(ThreadStartIdx); free(ThreadEndIdx); free(ThreadList); } return; } |
This file also includes a validateInputs function. I did not include it in the code snippet above for brevity; you can read it directly in the FEX entry (max_in_place.c). This function checks that there are exactly 0 output and 2 input args, that the input args are real non-sparse matrices and that they have the same number of elements.
Note that the threads run a generic thread_func function, which in turn runs the main_loop function with the thread’s specified startIndex
, endIndex
values. When this function completes, the thread ends itself explicitly, to ensure resource cleanup.
Also note how the thread code is using pthreads on non-Windows (!defined(_WIN32)
) machines, and native Windows threads otherwise. This means that the same MEX source-code could be used on all Matlab platforms.
The important thing to note about this framework is that we no longer need to worry about the thread plumbing. If we wish to adapt this code for any other processing, we just need to modify the main_loop function with the new processing logic. In addition, you may wish to modify the validateInputs function based on your new setup of input/output args.
A few caveats:
- On Windows machines with R2017b or older, we simply compile using
mex max_in_place.c
; on non-Windows we might need to add the–lpthread
flag to link the pthreads library, depending on your specific compiler. - On R2018a or newer on all platforms, due to MEX’s new interleaved-complex memory format, we would need to compile with the
-R2017b
flag if we wish to use mexGetPr, as in the sample code above (in R2018a’s new data model, the corresponding function is mxGetDoubles). Note that updating data in-place becomes more difficult with the new MEX API, so if you want to preserve the performance boost that in-place data manipulation provides, it may be better to stick with the legacy data memory model. - The sample code above splits the data between the threads based on the first input matrix’s size. Instead, you may consider sending to the MEX function the loop indexes as extra input args, and then splitting those up between the threads.
- In this specific implementation of max_in_place, I have updated the data locations directly. This is generally discouraged and risky, because it conflicts with Matlab’s standard Copy-on-Write mechanism. For example, if we assign the input to any other Matlab variable(s) before calling max_in_place, then that other variable(s) will also get their data updated. If we do not want this side-effect, we should mxUnshareArray the input matrix1, and return the resulting matrix as an output of the MEX function (
plhs[0]
).
Speed-up tips
The core logic in the specific case that I was asked to optimize was something similar to this:
main_process: initialize output matrix loop z over all slices in a 3D data matrix temp_data = data(:,:,z); temp_data = process(temp_data); output = max(output, temp_data); end z loop |
The initial speed-up attempt was naturally focused on the process
and max
functions. Converting them to a MEX function improved the speed by a factor of ~8, but the resulting run-time (4-5 secs) was still too slow for real-time use. The reason that we did not see a larger speed-up was, I believe, due to several reasons:
temp_data
was small enough such that the overheads associated with creating and then disposing separate threads were significant compared to the processing time of each thread.temp_data
was small enough such that each thread processed a relatively small portion of the memory, in contrast to single-threaded processing that accesses memory in larger blocks, more efficiently.- In each iteration of the z loop, the overheads associated with calling the MEX function, handling input variables and validation, creating/disposing threads, and allocating/deallocating memory for
temp_data
, were repeatedly paid.
So, while the profiling result showed that 98% of the time was spent in the MEX function (which would seem to indicate that not much additional speedup can be achieved), in fact the MEX function was under-performing because of the inefficiencies involved in repeatedly creating threads to process small data chunks. It turned out that running in single-thread mode was actually somewhat faster than multi-threaded mode.
I then moved the entire z loop (entire main_process
) into the MEX function, where the threads were split to process separate adjacent blocks of z slices (i.e. different blocks of the z loop). This way, the MEX function was called, the inputs validated, and threads created/disposed only once for the entire process, making this overhead negligible compared to each thread’s processing time. Moreover, each thread now processed the entire temp_data
belonging to its z slice, so memory access was more efficient, reducing the memory I/O wait time and improving the overall processing time. Additional benefits were due to the fact that some variables could be reused within each thread across loop iterations, minimizing memory allocations and deallocations. The overall effect was to reduce the total run-time down to ~0.16 secs, a 160x speedup compared to the original (25 secs). As my client said: “You sped up [the application] from practically useless to clinically useful.”
The lesson: try to minimize MEX invocation and thread creation/disposal overheads, and let the threads process as large adjacent memory blocks as possible.
Debugging MEX files
When debugging multi-threaded MEX functions, I find that it’s often easier to run the function in single-threaded mode to debug the core logic, and once this is ready we can switch back multi-threading. This can easily be done by setting the number of threads outside the MEX function using Matlab’s builtin maxNumCompThreads function:
Nthreads = maxNumCompThreads(1); % temporarily use only 1 thread for debugging max_in_place(matrix1, matrix2); maxNumCompThreads(Nthreads); % restore previous value %maxNumCompThreads('automatic'); % alternative |
Once debugging is done and the MEX function works properly, we should remove the maxNumCompThreads calls, so that the MEX function will use the regular number of Matlab computational threads, which should be the same as the number of cores: feature(‘numCores’).
I typically like to use Eclipse as my IDE for non-Matlab code (Java, C/C++ etc.). Unfortunately, there’s a problem attaching Eclipse to Matlab processes (which is necessary for interactive MEX debugging) if you’re using any recent (post-2015) version of MinGW and Eclipse. This problem is due to a known Eclipse bug, as user Lithe pointed out. The workaround is to install an old version of MinGW, *in addition* to your existing MinGW version. Reportedly, only versions 4.9.1 or older of MinGW include gdb 7.8 (which is still supported by Eclipse), whereas newer versions of MinGW include a newer gdb that is not supported. Download and install such an old MinGW version in a separate folder from your more-modern compiler. Don’t update your MEX to use the older MinGW – just tell Eclipse to use the version of gdb in the old MinGW bin/ folder when you set up a debug configuration for debugging your MEX files.
Once you have a compatible gdb, and ask Eclipse to attach to a process, the processes list will finally appear (it won’t with an incompatible gdb). Use feature('getPID')
to get your Matlab process ID, which can then used to attach to the relevant process in the Eclipse Attach-to-process window. For example, if your Matlab’s PID is 4321, then the Matlab process will be called “Program – 4321” in Eclipse’s processes list.
I wish that MathWorks would update their official Answer and their MinGW support package on File Exchange to include this information, because without it debugging on Eclipse becomes impossible. Eclipse is so powerful, easy-to-use and ubiquitous that it’s a shame for most users not to be able to work with it just because the workarounds above are not readily explained.
N.B. If you don’t like Eclipse, you can also use Visual Studio Code (VS Code), as Andy Campbell recently explained in the MathWorks Developers’ blog.
Consulting
Do you have any Matlab code that could use a bit (or a lot) of speeding-up? If so, please contact me for a private consulting offer. I can’t promise to speed up your code by a similar factor of 160x, but you never know…
Yair, Can you comment on what the distinction would be for using the ANSI C memory allocation routines (malloc/calloc/realloc/free) versus the MATLAB-specific variants (mxMalloc/mxCalloc/mxRealloc/mxFree)? The ANSI versions are used in your example above, and so I’m wondering if there was a specific rationale or if it comes down to personal preference.
@Andy – In general I believe that it is better to use MEX’s mxMalloc/mxCalloc/mxRealloc/mxFree than their ANSI-C counterparts. This will ensure that all the MEX function’s memory is properly and automatically managed by Matlab’s memory manager, reducing the likelihood of segmentation violations and memory leaks. I admit that I did not follow my own advice in the code presented here. It doesn’t make much difference if you allocate and free the memory properly, but using the mx* variants is a bit safer in general.
Note that using mxFree to deallocate memory that was originally allocated using mxCreate*, or conversely using mxDestroyArray to deallocate memory that was originally allocated using mx*alloc, leads to a segmentation fault – mx*alloc should always and only be used with a corresponding mxFree, and similarly mxCreate* should always and only be used with a corresponding mxDestroyArray.
Another error occurs when trying to mxDestroyArray or mxFree within a C++ destructor, since at that point the memory manager has already freed the data.
Yair,
One multithreaded mex file I can’t say enough good things about is getmatvar which, in addition to being faster than MATLAB builtins (i.e. matfile objects), allows you to read part of a struct without loading an entire mat file!
Also worth checking out is mexthread if you have a compiler with std::thread
I’ve seen a couple mex files pop up which use the BOOST library as well, but I can’t think of one at this moment. Maybe another reader can link to an example.
Hi,
I changed the size of matrices from 4 to 10000, i.e. matrix1=rand(10000) and matrix2=rand(10000);
Then, I compared the speed in terms of the number of the threads.
For Nthreads=1 and Nthreads=4 the runtime is almost the same!!! I do not know why?!?!
The mexCallMATLAB function raises interesting possibilities for parallelization.
I spent a few hours (days…) writing a mexFunction that mimics the functionality of MATLAB’s pagefun but on the CPU. I put the call to mexCallMATLAB inside a
#pragma omp parallel
for loop, so essentially it calls the builtin functions (e.g. eig, svd, inv, etc.) rather than recreating all the LAPACK interfacing in C++. It works on a single thread but crashes if num_threads > 1. Any ideas what cardinal rule this is breaking?Note Intel’s MKL implementations of BLAS/LAPACK are already multi-threaded, so it is likely that you won’t achieve a vast improvement (the overall performance could even be harmed) in doing this, unless you ensure BLAS calls use single threading.