Multi-threaded Mex

I was recently asked by a consulting client to help speed up a Matlab process. Accelerating MATLAB Performance Quite often there are various ways to improve the run-time, and in this particular case it turned out that the best option was to convert the core Matlab processing loop into a multi-threaded Mex function, while keeping the rest (vast majority of program code) in easy-to-maintain Matlab. This resulted in a 160x speedup (25 secs => 0.16 secs). Some of this speedup is attributed to C-code being faster in general than Matlab, another part is due to the multi-threading, and another due to in-place data manipulations that avoid costly memory access and re-allocations.

In today’s post I will share some of the insights relating to this MEX conversion, which could be adapted for many other similar use-cases. Additional Matlab speed-up techniques can be found in other performance-related posts on this website, as well in my book Accelerating MATLAB Performance.

There are quite a few online resources about creating Mex files, so I will not focus on this aspect. I’ll assume that the reader is already familiar with the concept of using Mex functions, which are simply dynamically-linked libraries that have a predefined entry-function syntax and predefined platform-specific extension. Instead, I’ll focus on how to create and debug a multi-threaded Mex function, so that it runs in parallel on all CPU cores.

The benefit of multi-threading is that threads are very light-weight objects, that have minimal performance and memory overheads. This contrasts to multi-tasking, which is what the Parallel Computing Toolbox currently does: launches duplicate copies of the entire Matlab engine process (“headless workers”) and then manages and coordinates the tasks to split up the processing work. Multi-tasking should be avoided wherever we can employ light-weight multi-threading instead. Unfortunately, Matlab does not currently have the ability to explicitly multi-thread Matlab code. But we can still use explicit multi-threading by invoking code in other languages, as I’ve already shown for Java, C# (and .NET in general), and C/C++. Today’s article will expand on the latter post (the one about C/C++ multi-threading), by showing a general framework for making a multi-threaded C-based Mex function.

There are several alternative implementation of threads. On non-Windows machines, POSIX threads (“pthreads”) are a de-facto standard; on Windows, which pthreads can still be used, they generally use native Windows threads under the hood, and we can use these native threads directly.

I have uploaded a file called max_in_place to the Matlab File Exchange. This function serves as an example showing a generic multi-threaded Mex function. A compiled version of this Mex file for Win64 is also included in the File Exchange entry, and you can run it directly on a Win64 machine.

The usage in Matlab is as follows (note how matrix1 is updated in-place):

>> matrix1 = rand(4)
matrix1 =
      0.89092      0.14929      0.81428      0.19664
      0.95929      0.25751      0.24352      0.25108
      0.54722      0.84072      0.92926      0.61604
      0.13862      0.25428      0.34998      0.47329
 
>> matrix2 = rand(4)
matrix2 =
      0.35166      0.91719      0.38045      0.53081
      0.83083      0.28584      0.56782      0.77917
      0.58526       0.7572     0.075854      0.93401
      0.54972      0.75373      0.05395      0.12991
 
>> max_in_place(matrix1, matrix2)
 
>> matrix1
matrix1 =
      0.89092      0.91719      0.81428      0.53081
      0.95929      0.28584      0.56782      0.77917
      0.58526      0.84072      0.92926      0.93401
      0.54972      0.75373      0.34998      0.47329

Source code and discussion

The pseudo-code for the MEX function is as follows:

mexFunction():
   validate the input/output args
   quick bail-out if empty inputs
   get the number of threads N from Matlab's maxNumCompThreads function
   if N == 1
       run main_loop directly
   else
       split input matrix #1 into N index blocks
       assign start/end index for each thread
       create and launch N new threads that run main_loop
       wait for all N threads to complete
       free the allocated memory resources
   end

Here’s the core source-code of this function, which was adapted from original work by Dirk-Jan Kroon:

/*====================================================================
 *
 * max_in_place.c  updates a data matrix in-place with the max value
 *                 of the matrix and a 2nd matrix of the same size
 *
 * The calling syntax is:
 *
 *		max_in_place(matrix1, matrix2)
 *
 * matrix1 will be updated with the maximal values from corresponding
 * indices of the 2 matrices
 *
 * Both inputs must be double 2D real non-sparse matrices of same size
 *
 * Yair Altman 2018-07-18
 * http://UndocumentedMatlab.com/blog/multi-threaded-mex
 *
 * Adapted from original work by Dirk-Jan Kroon
 * http://mathworks.com/matlabcentral/profile/authors/1097878-dirk-jan-kroon
 *
 *==================================================================*/
 
#include <math.h>
#include "mex.h"
 
/* undef needed for LCC compiler */
#undef EXTERN_C
#ifdef _WIN32
    #include <windows.h>
    #include <process.h>
#else
    #include <pthread.h>
#endif
 
/* Input Arguments */
#define	hMatrix1	prhs[0]
#define	hMatrix2	prhs[1]
 
/* Macros */
#if !defined(MAX)
#define	MIN(A, B)	((A) < (B) ? (A) : (B))
#endif
 
/* Main processing loop function */
void main_loop(const mxArray *prhs[], int startIdx, int endIdx)
{
    /* Assign pointers to the various parameters */
    double *p1 = mxGetPr(hMatrix1);
    double *p2 = mxGetPr(hMatrix2);
 
    /* Loop through all matrix coordinates */
    for (int idx=startIdx; idx<=endIdx; idx++)
    {
        /* Update hMatrix1 with the maximal value of hMatrix1,hMatrix2 */
        if (p1[idx] < p2[idx]) {
            p1[idx] = p2[idx];
        }
    }
}
 
/* Computation function in threads */
#ifdef _WIN32
  unsigned __stdcall thread_func(void *ThreadArgs_) {
#else
  void thread_func(void *ThreadArgs_) {
#endif
    double **ThreadArgs = ThreadArgs_;  /* void* => double** */
    const mxArray** prhs = (const mxArray**) ThreadArgs[0];
 
    int ThreadID = (int) ThreadArgs[1][0];
    int startIdx = (int) ThreadArgs[2][0];
    int endIdx   = (int) ThreadArgs[3][0];
    /*mexPrintf("Starting thread #%d: idx=%d:%d\n", ThreadID, startIdx, endIdx); */
 
    /* Run the main processing function */
    main_loop(prhs, startIdx, endIdx);
 
    /* Explicit end thread, helps to ensure proper recovery of resources allocated for the thread */
    #ifdef _WIN32
        _endthreadex( 0 );
        return 0;
    #else
        pthread_exit(NULL);
    #endif
}
 
/* validateInputs function here... */
 
/* Main entry function */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
 
{
    /* Validate the inputs */
    validateInputs(nlhs, plhs, nrhs, prhs);
 
    /* Quick bail-out in the trivial case of empty inputs */
    if (mxIsEmpty(hMatrix1))  return;
 
    /* Get the number of threads from the Matlab engine (maxNumCompThreads) */
    mxArray *matlabCallOut[1] = {0};
    mxArray *matlabCallIn[1]  = {0};
    mexCallMATLAB(1, matlabCallOut, 0, matlabCallIn, "maxNumCompThreads");
    double *Nthreadsd = mxGetPr(matlabCallOut[0]);
    int Nthreads = (int) Nthreadsd[0];
 
    /* Get the number of elements to process */
    size_t n1 = mxGetNumberOfElements(hMatrix1);
 
    if (Nthreads == 1) {
 
        /* Process the inputs directly (not via a thread) */
        main_loop(prhs, 0, n1-1);
 
    } else {  /* multi-threaded */
 
        /* Allocate memory for handles of worker threads */
        #ifdef _WIN32
            HANDLE    *ThreadList = (HANDLE*)   malloc(Nthreads*sizeof(HANDLE));
        #else
            pthread_t *ThreadList = (pthread_t*)malloc(Nthreads*sizeof(pthread_t));
        #endif
 
        /* Allocate memory for the thread arguments (attributes) */
        double **ThreadID, **ThreadStartIdx, **ThreadEndIdx, ***ThreadArgs;
        double *ThreadID1, *ThreadStartIdx1, *ThreadEndIdx1, **ThreadArgs1;
 
        ThreadID       = (double **) malloc( Nthreads* sizeof(double *) );
        ThreadStartIdx = (double **) malloc( Nthreads* sizeof(double *) );
        ThreadEndIdx   = (double **) malloc( Nthreads* sizeof(double *) );
        ThreadArgs     = (double ***)malloc( Nthreads* sizeof(double **) );
 
        /* Launch the requested number of threads */
        int i;
        int threadBlockSize = ceil( ((double)n1) / Nthreads );
        for (i=0; i<Nthreads; i++)
        {
            /* Create thread ID */
            ThreadID1 = (double *)malloc( 1* sizeof(double) );
            ThreadID1[0] = i;
            ThreadID[i] = ThreadID1;
 
            /* Compute start/end indexes for this thread */
            ThreadStartIdx1 = (double *) malloc( sizeof(double) );
            ThreadStartIdx1[0] = i * threadBlockSize;
            ThreadStartIdx[i] = ThreadStartIdx1;
 
            ThreadEndIdx1 = (double *) malloc( sizeof(double) );
            ThreadEndIdx1[0] = MIN((i+1)*threadBlockSize, n1) - 1;
            ThreadEndIdx[i] = ThreadEndIdx1;
 
            /* Prepare thread input args */
            ThreadArgs1 = (double **) malloc( 4* sizeof(double*) );
            ThreadArgs1[0] = (double *) prhs;
            ThreadArgs1[1] = ThreadID[i];
            ThreadArgs1[2] = ThreadStartIdx[i];
            ThreadArgs1[3] = ThreadEndIdx[i];
 
            ThreadArgs[i] = ThreadArgs1;
 
            /* Launch the thread with its associated args */
            #ifdef _WIN32
                ThreadList[i] = (HANDLE)_beginthreadex(NULL, 0, &thread_func, ThreadArgs[i], 0, NULL);
            #else
                pthread_create ((pthread_t*)&ThreadList[i], NULL, (void *) &thread_func, ThreadArgs[i]);
            #endif
        }
 
        /* Wait for all the treads to finish working */
        #ifdef _WIN32
            for (i=0; i<Nthreads; i++) { WaitForSingleObject(ThreadList[i], INFINITE); }
            for (i=0; i<Nthreads; i++) { CloseHandle( ThreadList[i] ); }
        #else
            for (i=0; i<Nthreads; i++) { pthread_join(ThreadList[i],NULL); }
        #endif
 
        /* Free the memory resources allocated for the threads */
        for (i=0; i<Nthreads; i++)
        {
            free(ThreadArgs[i]);
            free(ThreadID[i]);
            free(ThreadStartIdx[i]);
            free(ThreadEndIdx[i]);
        }
 
        free(ThreadArgs);
        free(ThreadID );
        free(ThreadStartIdx);
        free(ThreadEndIdx);
        free(ThreadList);
    }
 
    return;
}

This file also includes a validateInputs function. I did not include it in the code snippet above for brevity; you can read it directly in the FEX entry (max_in_place.c). This function checks that there are exactly 0 output and 2 input args, that the input args are real non-sparse matrices and that they have the same number of elements.

Note that the threads run a generic thread_func function, which in turn runs the main_loop function with the thread’s specified startIndex, endIndex values. When this function completes, the thread ends itself explicitly, to ensure resource cleanup.

Also note how the thread code is using pthreads on non-Windows (!defined(_WIN32)) machines, and native Windows threads otherwise. This means that the same MEX source-code could be used on all Matlab platforms.

The important thing to note about this framework is that we no longer need to worry about the thread plumbing. If we wish to adapt this code for any other processing, we just need to modify the main_loop function with the new processing logic. In addition, you may wish to modify the validateInputs function based on your new setup of input/output args.

A few caveats:

  • On Windows machines with R2017b or older, we simply compile using mex max_in_place.c; on non-Windows we might need to add the –lpthread flag to link the pthreads library, depending on your specific compiler.
  • On R2018a or newer on all platforms, due to MEX’s new interleaved-complex memory format, we would need to compile with the -R2017b flag if we wish to use mexGetPr, as in the sample code above (in R2018a’s new data model, the corresponding function is mxGetDoubles). Note that updating data in-place becomes more difficult with the new MEX API, so if you want to preserve the performance boost that in-place data manipulation provides, it may be better to stick with the legacy data memory model.
  • The sample code above splits the data between the threads based on the first input matrix’s size. Instead, you may consider sending to the MEX function the loop indexes as extra input args, and then splitting those up between the threads.
  • In this specific implementation of max_in_place, I have updated the data locations directly. This is generally discouraged and risky, because it conflicts with Matlab’s standard Copy-on-Write mechanism. For example, if we assign the input to any other Matlab variable(s) before calling max_in_place, then that other variable(s) will also get their data updated. If we do not want this side-effect, we should mxUnshareArray the input matrix1, and return the resulting matrix as an output of the MEX function (plhs[0]).

Speed-up tips

The core logic in the specific case that I was asked to optimize was something similar to this:

main_process:
    initialize output matrix
    loop z over all slices in a 3D data matrix
        temp_data = data(:,:,z);
        temp_data = process(temp_data);
        output = max(output, temp_data);
    end z loop

The initial speed-up attempt was naturally focused on the process and max functions. Converting them to a MEX function improved the speed by a factor of ~8, but the resulting run-time (4-5 secs) was still too slow for real-time use. The reason that we did not see a larger speed-up was, I believe, due to several reasons:

  • temp_data was small enough such that the overheads associated with creating and then disposing separate threads were significant compared to the processing time of each thread.
  • temp_data was small enough such that each thread processed a relatively small portion of the memory, in contrast to single-threaded processing that accesses memory in larger blocks, more efficiently.
  • In each iteration of the z loop, the overheads associated with calling the MEX function, handling input variables and validation, creating/disposing threads, and allocating/deallocating memory for temp_data, were repeatedly paid.

So, while the profiling result showed that 98% of the time was spent in the MEX function (which would seem to indicate that not much additional speedup can be achieved), in fact the MEX function was under-performing because of the inefficiencies involved in repeatedly creating threads to process small data chunks. It turned out that running in single-thread mode was actually somewhat faster than multi-threaded mode.

I then moved the entire z loop (entire main_process) into the MEX function, where the threads were split to process separate adjacent blocks of z slices (i.e. different blocks of the z loop). This way, the MEX function was called, the inputs validated, and threads created/disposed only once for the entire process, making this overhead negligible compared to each thread’s processing time. Moreover, each thread now processed the entire temp_data belonging to its z slice, so memory access was more efficient, reducing the memory I/O wait time and improving the overall processing time. Additional benefits were due to the fact that some variables could be reused within each thread across loop iterations, minimizing memory allocations and deallocations. The overall effect was to reduce the total run-time down to ~0.16 secs, a 160x speedup compared to the original (25 secs). As my client said: “You sped up [the application] from practically useless to clinically useful.”

The lesson: try to minimize MEX invocation and thread creation/disposal overheads, and let the threads process as large adjacent memory blocks as possible.

Debugging MEX files

When debugging multi-threaded MEX functions, I find that it’s often easier to run the function in single-threaded mode to debug the core logic, and once this is ready we can switch back multi-threading. This can easily be done by setting the number of threads outside the MEX function using Matlab’s builtin maxNumCompThreads function:

Nthreads = maxNumCompThreads(1);  % temporarily use only 1 thread for debugging
max_in_place(matrix1, matrix2);
maxNumCompThreads(Nthreads);      % restore previous value
%maxNumCompThreads('automatic');  % alternative

Once debugging is done and the MEX function works properly, we should remove the maxNumCompThreads calls, so that the MEX function will use the regular number of Matlab computational threads, which should be the same as the number of cores: feature(‘numCores’).

I typically like to use Eclipse as my IDE for non-Matlab code (Java, C/C++ etc.). Unfortunately, there’s a problem attaching Eclipse to Matlab processes (which is necessary for interactive MEX debugging) if you’re using any recent (post-2015) version of MinGW and Eclipse. This problem is due to a known Eclipse bug, as user Lithe pointed out. The workaround is to install an old version of MinGW, *in addition* to your existing MinGW version. Reportedly, only versions 4.9.1 or older of MinGW include gdb 7.8 (which is still supported by Eclipse), whereas newer versions of MinGW include a newer gdb that is not supported. Download and install such an old MinGW version in a separate folder from your more-modern compiler. Don’t update your MEX to use the older MinGW – just tell Eclipse to use the version of gdb in the old MinGW bin/ folder when you set up a debug configuration for debugging your MEX files.

Once you have a compatible gdb, and ask Eclipse to attach to a process, the processes list will finally appear (it won’t with an incompatible gdb). Use feature('getPID') to get your Matlab process ID, which can then used to attach to the relevant process in the Eclipse Attach-to-process window. For example, if your Matlab’s PID is 4321, then the Matlab process will be called “Program – 4321” in Eclipse’s processes list.

I wish that MathWorks would update their official Answer and their MinGW support package on File Exchange to include this information, because without it debugging on Eclipse becomes impossible. Eclipse is so powerful, easy-to-use and ubiquitous that it’s a shame for most users not to be able to work with it just because the workarounds above are not readily explained.

N.B. If you don’t like Eclipse, you can also use Visual Studio Code (VS Code), as Andy Campbell recently explained in the MathWorks Developers’ blog.

Consulting

Do you have any Matlab code that could use a bit (or a lot) of speeding-up? If so, please contact me for a private consulting offer. I can’t promise to speed up your code by a similar factor of 160x, but you never know…

Categories: Low risk of breaking in future versions, Mex, Undocumented feature

Tags: , ,

Bookmark and SharePrint Print

3 Responses to Multi-threaded Mex

  1. Andy Stamps says:

    Yair, Can you comment on what the distinction would be for using the ANSI C memory allocation routines (malloc/calloc/realloc/free) versus the MATLAB-specific variants (mxMalloc/mxCalloc/mxRealloc/mxFree)? The ANSI versions are used in your example above, and so I’m wondering if there was a specific rationale or if it comes down to personal preference.

    • @Andy – In general I believe that it is better to use MEX’s mxMalloc/mxCalloc/mxRealloc/mxFree than their ANSI-C counterparts. This will ensure that all the MEX function’s memory is properly and automatically managed by Matlab’s memory manager, reducing the likelihood of segmentation violations and memory leaks. I admit that I did not follow my own advice in the code presented here. It doesn’t make much difference if you allocate and free the memory properly, but using the mx* variants is a bit safer in general.

      Note that using mxFree to deallocate memory that was originally allocated using mxCreate*, or conversely using mxDestroyArray to deallocate memory that was originally allocated using mx*alloc, leads to a segmentation fault – mx*alloc should always and only be used with a corresponding mxFree, and similarly mxCreate* should always and only be used with a corresponding mxDestroyArray.

      Another error occurs when trying to mxDestroyArray or mxFree within a C++ destructor, since at that point the memory manager has already freed the data.

  2. Peter Cook says:

    Yair,

    One multithreaded mex file I can’t say enough good things about is getmatvar which, in addition to being faster than MATLAB builtins (i.e. matfile objects), allows you to read part of a struct without loading an entire mat file!

    Also worth checking out is mexthread if you have a compiler with std::thread

    I’ve seen a couple mex files pop up which use the BOOST library as well, but I can’t think of one at this moment. Maybe another reader can link to an example.

Leave a Reply

Your email address will not be published. Required fields are marked *