Comments on: Allocation performance take 2

By: Roberto

Roberto — Thu, 05 Sep 2013 15:58:23 +0000

The results I get from R2012b on my MacBook Pro (10.8.4) are quite different from the plot you've been sent. The performance of zeros and ones is practically identical for me, to the point that the two lines in the plot are practically indistinguishable (the maximum difference between the two is about 0.01 sec without normalising by the number of iterations). Also, the time growth is linear (no improvement for 200K+ elements), which makes me wonder if R2013a introduced a new memory allocation algorithm.

By: Yair Altman

Yair Altman — Tue, 20 Aug 2013 21:36:15 +0000

In reply to Michelle Hirsch. @Michelle - thanks for the clarification

By: Michelle Hirsch

Michelle Hirsch — Tue, 20 Aug 2013 14:11:38 +0000

Interesting assessment Yair, but it turns out that the reasons for the behavior changes aren’t what you thought. The performance change for zeros in R2008b resulted from a change in the underlying MATLAB memory management architecture at that time.

By: Yair Altman

Yair Altman — Thu, 15 Aug 2013 18:06:26 +0000

In reply to Amro. @Amro - thanks for the detailed comment and references. You may indeed be correct regarding Intel, since I've received the following results for a MacBook Pro (R2013a, Mountain Lion) from Malcolm Lidierth (thanks!):

By: Amro

Amro — Thu, 15 Aug 2013 00:04:06 +0000

Just yesterday, I was looking into a similar topic on Stack Overflow: http://stackoverflow.com/a/18217986/97160

One of the things I found while investigating the issue is that MATLAB appears to be using a custom memory allocator optimized for multi-threaded cases, namely Intel TBB scalable memory allocator (libmx.dll had a dependency on tbbmalloc.dll which is Intel’s library). I suspect that the implementation of zeros switch to this parallel memory allocator once the size is large enough.

btw there are all sorts of memory allocators out there, each claiming to be better than the others: http://en.wikipedia.org/wiki/Malloc#Implementations

—

I should point out that “bzero” you mentioned is now deprecated [1], and even appears to be using the same underlying call as “memset” [2]. Even the specialized “ZeroMemory” in the Win32 API is typedef’ed against “memset” [3] (which is probably optimized for your platform, whether that’s implemented in kernel code or by the CRT library).

I think the difference between zeros and ones could be explained by the performance of malloc+memset vs. calloc. There’s an excellent explanation over here: http://stackoverflow.com/a/2688522/97160
_
[1]: http://c-unix-linux.blogspot.com/2009/01/bzero-and-memset.html
[2]: http://fdiv.net/2009/01/14/memset-vs-bzero-ultimate-showdown
[3]: http://stackoverflow.com/questions/3038302/why-do-zeromemory-etc-exist-when-there-are-memset-etc-already

By: Joshua Leahy

Joshua Leahy — Wed, 14 Aug 2013 23:41:05 +0000

It’s likely that zeros becomes much faster than ones at such large sizes because matlab would switch to using mmap rather than a combination of malloc and bzero. Mmap provides you with any amount of prezeroed memory in constant time.

The trick is achieved by the operating system lazily allocating the memory on first use. If I’m right then you might see a penalty on the first use of larger allocations, but not on smaller ones.