Another couple of Matlab bugs and workarounds

November 26, 2014 32 Comments

Every now and then I come across some internal Matlab bugs. In many cases I find a workaround and move on, sometimes bothering to report the bugs to MathWorks support, but often not. In truth, it’s a bit frustrating to hear the standard response that the issue [or “unexpected behavior”, but never “bug” – apparently that’s a taboo word] “has been reported to the development team and they will consider fixing it in one of the future releases of MATLAB”.
To date I’ve reported dozens of bugs and as far as I can tell, few if any of them have actually been fixed, years after I’ve reported them. None of them appear on Matlab’s official bug parade, which is only a small subset of the full list that MathWorks keeps hidden for some unknown reason (update: see the discussion in the comments thread below, especially the input by Steve Eddins). Never mind, I don’t take it personally, I simply find a workaround and move on. I’ve already posted about this before. Today I’ll discuss two additional bugs I’ve run across once-too-often, and my workarounds:

Nothing really earth-shattering, but annoying nonetheless.

Saving non-Latin Command Window text using diary

The diary function is well-known for saving Matlab’s Command-Window (CW) text to a file. The function has existed for the past two decades at least, possibly even longer.
Unfortunately, perhaps the developer never thought that Matlab would be used outside the Americas and Western Europe, otherwise I cannot understand why to this day diary saves the text in ASCII format rather than the UTF-16 variant used by the CW. This works ok for basic Latin characters, but anyone who outputs Chinese, Japanese, Korean, Hindi, Arabic, Hebrew or other alphabets to the CW, and tries to save it using diary, will find the file unreadable.
Here is a sample illustrative script, that outputs the Arabic word salaam (peace, سلام) to the CW and then tries to save this using diary. If you try it, you will see it ok in the CW, but garbage text in the generated text file:

>> fname='diary_bug.txt'; diary(fname); disp(char([1587,1604,1575,1605])); diary off; winopen(fname)
سلام

The problem is that since diary assumes ASCII characters, any characters having a numeric value above 255 get truncated and are stored as invalid 1-byte characters, char(26) in this case.
Here’s my workaround:

% Output Command Window text to a text file
function saveCmdWinText(filename)
    cmdWinDoc = com.mathworks.mde.cmdwin.CmdWinDocument.getInstance;
    txt = char(cmdWinDoc.getText(0,cmdWinDoc.getLength));
    fid = fopen(filename,'W');
    fwrite(fid,txt,'uint16');  % store as 2-byte characters
    fclose(fid);
    %winopen(filename);  % in case you wish to verify...
end

This works well, saving the characters in their original 2-byte format, for those alphabets that use 2-bytes: non-basic Latins, Greek, Cyrillic, Armenian, Arabic, Hebrew, Coptic, Syriac and Tāna (I don’t think there are more than a handful of Matlab users who use Coptic, Syriac or Tāna but never mind). However, UTF-8 specifies that CJK characters need 3-4 bytes and this is apparently not supported in Matlab, whose basic char data type only has 2 bytes, so I assume that Chinese, Japanese and Korean will probably require a different solution (perhaps the internal implementation of char and the CW is different in the Chinese/Japanese versions of Matlab, I really don’t know. If this is indeed the case, then perhaps a variant of my workaround can also be used for CJK output).

Correction #1: I have learned since posting (see Steve Eddins’ comment below) that Matlab actually uses UTF-16 rather than UTF-8, solving the CJK issue. I humbly stand corrected.
Correction #2: The saveCmdWinText code above saves the CW text in UTF-16 format. This may be problematic in some text editors that are not UTF-savvy. For such editors (or if your editor get confused with the various BOM/endianness options), consider saving the data in UTF-8 format – again, assuming you’re not using an alphabet [such as CJK] outside the ASCII range (thanks Rob):
function saveCmdWinText_UTF8(filename)
    cmdWinDoc = com.mathworks.mde.cmdwin.CmdWinDocument.getInstance;
    txt = char(cmdWinDoc.getText(0,cmdWinDoc.getLength));
    fid = fopen(filename,'W','n','utf-8');
    fwrite(fid,txt,'char');
    fclose(fid);
    %winopen(filename);  % in case you wish to verify...
end
function saveCmdWinText_UTF8(filename) cmdWinDoc = com.mathworks.mde.cmdwin.CmdWinDocument.getInstance; txt = char(cmdWinDoc.getText(0,cmdWinDoc.getLength)); fid = fopen(filename,'W','n','utf-8'); fwrite(fid,txt,'char'); fclose(fid); %winopen(filename); % in case you wish to verify... end

Also, this workaround is problematic in the sense that it’s a one-time operation that stores the entire CW text that is visible at that point. This is more limited than diary‘s ability to start and stop output recording in mid-run, and to record output on-the-fly (rather than only at the end). Still, it does provide a solution in case you output non-ASCII 2-byte characters to the CW.

Update: I plan to post a utility to the Matlab File Exchange in the near future that will mimic diary‘s ability to start/stop text recording, rather than simply dumping the entire CW contents to file. I’ll update here when this utility is ready for download.

There are various other bugs related to entering non-Latin (and specifically RTL) characters in the CW and the Matlab Editor. Solving the diary bug is certainly the least of these worries. Life goes on…
p.s. – I typically use this translator to convert from native script to UTF codes that can be used in Matlab. I’m sure there are plenty of other translators, but this one does the job well enough for me.
For people interested in learning more about the Command Window internals, take a look at my cprintf and setPrompt utilities.

cprintf usage examples — **cprintf** usage examples

setPrompt usage examples — ***setPrompt*** usage examples

Printing GUIs reliably

Matlab has always tried to be far too sophisticated for its own good when printing figures. There’s plenty of internal code that tries to handle numerous circumstances in the figure contents for optimal output. Unfortunately, this code also has many bugs. Try printing even a slightly-complex GUI containing panels and/or Java controls and you’ll see components overlapping each other, not being printed, and/or being rendered incorrectly in the printed output. Not to mention the visible flicker that happens when Matlab modifies the figure in preparation for printing, and then modifies it back to the original.
All this when a simple printout of a screen-capture would be both much faster and 100% reliable.
Which is where my ScreenCapture utility comes in. Unlike Matlab’s print and getframe, ScreenCapture takes an actual screen-capture of an entire figure, or part of a figure (or even a desktop area outside any Matlab figure), and can then send the resulting image to a Matlab variable (2D RGB image), an image file, system clipboard, or the printer. We can easily modify the <Print> toolbar button and menu item to use this utility rather than the builtin print function:

hToolbar = findall(gcf,'tag','FigureToolBar');
hPrintButton = findall(hToolbar, 'tag','Standard.PrintFigure');
set(hPrintButton, 'ClickedCallback','screencapture(gcbf,[],''printer'')');
hPrintMenuItem = findall(gcf, 'type','uimenu', 'tag','printMenu');
set(hPrintMenuItem,      'Callback','screencapture(gcbf,[],''printer'')');

This prints the entire figure, including the frame, menubar and toolbar (if any). If you just wish to print the figure’s content area, then make sure to create a top-level uipanel that spans the entire content area and in which all the contents are included. Then simply pass this top-level container handle to ScreenCapture:

hTopLevelContainer = uipanel('BorderType','none', 'Parent',gcf, 'Units','norm', 'Pos',[0,0,1,1]);
...
hToolbar = findall(gcf,'tag','FigureToolBar');
hPrintButton = findall(hToolbar, 'tag','Standard.PrintFigure');
set(hPrintButton, 'ClickedCallback',@(h,e)screencapture(hTopLevelContainer,[],'printer'));
hPrintMenuItem = findall(gcf, 'type','uimenu', 'tag','printMenu');
set(hPrintMenuItem,      'Callback',@(h,e)screencapture(hTopLevelContainer,[],'printer'));

In certain cases (depending on platform/OS/Matlab-release), the result may capture a few pixels from the figure’s window frame. This can easily be corrected by specifying a small offset to ScreenCapture:

set(hPrintButton, 'ClickedCallback',@(h,e)printPanel(hTopLevelContainer));
set(hPrintMenuItem,      'Callback',@(h,e)printPanel(hTopLevelContainer));
function printPanel(hTopLevelContainer)
    pos = getpixelposition(hTopLevelContainer);
    screencapture(hTopLevelContainer, pos+[2,4,0,0], 'printer');
end

32 Responses

Dan November 26, 2014 at 12:22 Reply

Yair,
As always, thank you for your posts. I heartily agree with you about the frustrations with TMW’s bug list. I find it especially obnoxious that a whole slew of new bugs always get announced in the weeks before each new update is released, just so that the new release can be marked as having resolved all those extra bugs that TMW was never willing to admit existed.

Dan
Mikhail November 26, 2014 at 23:43 Reply

Hi Yair,

I share your frustration. This standard reply “consider fixing it in a future release of MATLAB” really demotivates me from reporting any further bugs.

I’ve been fighting for years now with various internationalization problems in MATLAB (with most problems being able to come up with workaround). Some problems are indeed getting fixed, but some new appear as well.

It’s a never-ending battle, unless TMW starts considering their non-Latin users.
nose holes November 27, 2014 at 01:20 Reply

Hi Yair,

Couldn’t agree with you more about the unwillingness of TMW to acknowledge bugs. However, I think the support department is to blame for this. Once you get to speak to a developer things seem to speed up. That being said, the follow-up reports are dramatic.

NH
Brad Stiritz November 29, 2014 at 13:51 Reply

Hi Yair, I thought I’d share my own story and thoughts on this topic, just to counterbalance yours a bit. Since I bought my license six years or so ago, I’ve probably reported upwards of 50 confirmed bugs. I can’t be precise, as TMW recently “upgraded” their bug tracking system to be hosted on salesforce.com, and users’ complete service histories are no longer available! What I can see at present is that in the last two years, I’ve reported over 30 bugs.

Anyway, I have to give TMW credit for following-through relatively quickly, at least in the most serious cases. For example, I found a bug in the MATLAB debugging system which was corrected in the very next release. I also found a cluster of bugs in the object-oriented system, which I was able to bring to the attention of a senior development manager. He seemed very glad to hear from me & told me he planned to follow-up with his lead testing manager to try to understand how so many related bugs could have gotten through their test suites undetected. I’m pretty sure all of those have been fixed.

I do feel there are many conscientious TMWers who are super-serious about correcting bugs in their immediate domain of responsibility. Along with that manager, here are a few of the groups that I can attest show top-notch professionalism about bugs and suggestions, in some cases sending me back workarounds or patched code within days of reporting : Database Toolbox, DataFeed Toolbox, MATLAB table object.

All this said, a few bugs I’ve reported have been seemingly forgotten about. There’s a bug in the Distribution Fitting App which I still have to manually patch each release. So I’ve definitely suffered the same sense of frustration and disappointment that you expressed. You and your readers might get a laugh out of one in particular, which is so ironic as to border on the absurd :

If you click on the toolstrip button “Request Support”, and submit your request, a dialog will pop up showing you what you’ve written and all the associated info that will be submitted when you click OK. Clicking OK presents a message to the effect of “A copy of this information will be sent to the email address xxx”.

However, this isn’t actually true : you don’t get a copy of *any* of the information you submitted, other than the title! Most importantly, your description text explaining the problem is completely missing from TMW’s auto-response as well as the reply you’ll get from a support engineer. To be fair, the engineers will typically reiterate in their own words what you reported. With complex problems that require extensive back-and-forth, however, it’s annoying & unhelpful to not see the original problem statement in the email thread, as promised.

I brought this up several years ago with a senior technical manager, who was embarrassed enough to then submit his own request asking that the omission be corrected. We’re both still waiting on that one..! Maybe next year, as the Cubs fans say here in Chicago 😉 Happy Holidays!

Yair Altman November 29, 2014 at 14:11 Reply

@Brad – thanks for your detailed comment. I feel obliged to clarify my position: I am not in the business of MathWorks-bashing, far from it. Among the large software engineering companies that I know, MathWorks is probably the most conscious of user feedbacks, and I know for a fact that the engineers at TMW care a lot about it. It’s a great engineering company, and it really does care about its end-users. I’m also ok with some bugs not being fixed for a long time. After all, development resources are limited and priorities have to be followed.

What frustrates me is the lack of transparency: when you report a bug, you’re not told an expected ETA for a fix, nor is it posted anywhere online for us to follow, nor is the original submitter notified on progress with their reported issue. It’s all exasperated by the support engineers standard response, which (as Mikhail said above) is really a dis-motivator for further bug reports. I feel it’s a pity that TMW disagrees with me on this transparency matter.

Brad Stiritz November 30, 2014 at 21:18

Yair, no worries, I think it’s obvious to everyone who knows you that you’re deeply passionate about MATLAB and its capabilities & potential. Your site’s popularity is a testament to the great curiosity users have about hidden internals and constructive hacks. I think you can be safely described as a leading MATLAB expert and evangelist, and from that POV your frustration is completely understandable.

Taking this discussion to a higher level, though, you rightly point out that behind corporate technical decisions are always business decisions. I think it makes perfect sense for TMW to classify bugs (and mark some bugs as “classified”) based on the perceived potential to (inadvertently) divulge proprietary implementation details.

Remember Andy Grove, the former Intel CEO? His autobiography is titled “Only the Paranoid Survive”. TMW has a number of reasons to be paranoid. They sell a high-priced flagship product which faces free-product competition on several fronts : Python, R, Octave. TMW also seems to operate with very extended design cycles– in some cases approaching patent-drug development– and seemingly without much patent protection on all that work! (Recent awards & submissions appear to skew towards Simulink)

If TMW were to publish their complete MATLAB bug-list and fix assessments, I’m guessing top open-source developers could infer very useful and practical design information that would accelerate the free apps in their quest to close the gaps with MATLAB.

I noticed in your LI profile that you used to work at PicScout. I was a customer for awhile, so that item resonated with me. PS’s business model of course was completely predicated on the existence of [and tracking of] widespread theft of artistic property. Considering how much leeway the Copyright code gives to “transformative” appropriation (cf. link below), TMW simply has to play things close to the vest: http://en.wikipedia.org/wiki/Cariou_v._Prince
Yair Altman December 1, 2014 at 00:04

@Brad – Thanks for the detailed counter-point. I still contend that the vast majority of bugs do not fall under the category you described, i.e. those that could directly help the competition. In fact, the vast majority of bugs are specific edge-cases in existing Matlab functionality, that would not provide much insight into the engine internals nor on TMW’s development roadmap, but would significantly assist existing Matlab users. Numerical inaccuracies (such as in interp1 in R2012a for example), potential crashes, memory leaks and performance hotspots are all examples of bugs that are critical for Matlab users to know in order to avoid. After all, it’s much more difficult to avoid a problem if you don’t know that it exists. The potential business downside of disclosing such bugs is IMHO far outweighed by the enormous downside to the million-plus Matlab users who develop their code oblivious to these bugs.

Would you fly an airplane that was developed using an inaccurate interp1 function having a bug that was never publicly disclosed just because someone decided it might potentially help python’s developers?! [I’m purposely using an extreme example here to illustrate my point, naturally most bugs are not as important]
Brad Stiritz December 1, 2014 at 19:20

Yair,

Let’s keep in mind that TMW in fact provides a public database of bugs. So please correct me as needed, but could we restate your main contention as something like : “the vast majority of existing, non-competitively-sensitive MATLAB bugs are currently not in the public database, and should be” –?

This would be quite a provocative claim, indeed. Maybe not quite “J’accuse!” but still slightly dramatic for our little computer-language world.. 😉

https://en.wikipedia.org/wiki/J'accuse
Yair Altman December 2, 2014 at 04:31

hmm, I’m not so sure about the melodramatics… I do know for a fact that MathWorks keeps an entirely separate issue-tracking system – you can see numerous references to bug IDs such as g579710 (in javacomponent.m) or g479211 (open.m) or g368739 (mplayinst.m), which indicate that hundreds of thousands of separate issues are tracked in the “real” system – most of them are probably closed or not bugs (rejected/enhancements/…), but nonetheless it’s still hundreds of times larger than the “official” public database. I don’t accept a claim that all these issues were found to be business-sensitive enough to warrant hiding them from public view — a few percent possibly, but not 99.9%.

Again, my main point is the lack of transparency. A related aspect of the same lack of transparency is the fact that after submitting a bug, you are left entirely in the dark about it. In all other bug-tracking systems that I’ve used to date (and there were many), the original issuer is automatically notified when anything is done with the issue (additional info, changed status/priority etc). The fact that users cannot effectively determine whether a problem is a known Matlab issue or due to a problem on their specific platform/program, and once they report something they are often left in the dark, conveys a message of disrespect by MathWorks to the time and energy of their users – a message that is in complete contradiction to the care that I know MathWorks engineers do feel. Moreover, I’d venture a guess that many support calls could be avoided (translating into financial savings for MathWorks) if users had the full issue database to browse.

Again, let’s not get carried away with melodrama here. But it is indeed frustrating, especially for people like me who care about Matlab (who care enough to continue spending time and energy submitting bug reports and workarounds despite the frustration and the fact that it may already be a known issue).
Brad Stiritz December 3, 2014 at 14:34

Yair, thanks for the insight into this and for mentioning the separate g-system. Apologies for the exaggerated humor, that’s a bad habit of mine.

To wrap up my comments on a serious note, I’ve heard TMW’s business model described as very conservative and cautious. Their Glassdoor page shows excellent employee ratings of the CEO, Jack Little. This suggests the company is indeed making a good faith effort to follow through on their public commitment to rational processes, credibility & integrity (ref’s below)

Given the cost and trouble of maintaining two separate bug databases, I’m guessing their board likely would have had to sign off on the plan. Taking your estimate at face value, the cost-benefit analysis must have concluded that there was net downside to revealing much at all of the internal bug database. So I think we can presume TMW truly feels their dual-bug-database architecture & low-transparency stance was the “right answer and [the] best way to do things”, to quote their maxim.

You may be roughly correct in your assessment of the potential scope for increased transparency, but the weight of the evidence points to a strong internal argument against fuller disclosure. Not knowing that argument, I don’t want to rubber-stamp their choice. It would be very interesting to hear TMW’s perspective on this, as well as individual feelings about the issue within management.

References :
http://www.mathworks.com/company/aboutus/mission_values/values/rational.html
http://www.glassdoor.com/Reviews/MathWorks-Reviews-E17117.htm

John December 1, 2014 at 10:28 Reply

If you post a bug with a workaround, that bug will become far less likely to be fixed. For instance, you can write some code to make diary save Arabic text correctly. While there is a bug (in 20 year old code? Or is this just an old design – to use ASCII – being pushed past its limits as Matlab is used worldwide?), since it is not blocking anyone, it will get bumped down on the priority list. The time saved to convert and test diary to use unicode can be better spent on bugs that are actually blocking users, or on new features that paying customers want to see.

It’s all a tradeoff. Dev time is not unbounded and in the next X months, would you rather fix Y bugs that people can avoid, or Z bugs that people cannot avoid (think crashes and such).

Brad Stiritz December 1, 2014 at 18:59 Reply

@John

>If you post a bug with a workaround, that bug will become far less likely to be fixed.

I don’t see how you can be sure of this without reference to a controlled study? I’ve submitted bugs with workarounds twice. Both workarounds were verified by tech support. One was fixed immediately for the next release. The other is still lingering after several years.

Again, going back to my previous comment above, I feel there’s significant variation among TMW developers. Some just take a high degree of personal pride in their own code bases. Others apparently not so much, and seem to be able to avoid consequences for their buggy code (or else their replacements aren’t tasked with fixing the legacy problems).

On the bright side, I did see a TMW job posting not long ago that IIRC asked for 5 years of experience developing defect-free code. So hopefully the company is trying to raise the bar over time. FWIW, I think Google had the right approach, in recognizing early on that they had to continually tighten the entrance requirements for new hires. Larry or Sergey said they themselves wouldn’t get in the door now.

Yaroslav December 3, 2014 at 05:38 Reply

The non-latin alphabet issues are not unique to the diary function. For example, if one wishes to document a script—via comments—with a different language than English, then sometimes the editor shows the letters (e.g., Russian), and some-other-times it writes Gibberish (e.g., Hebrew). What is really frustrating, is that after exiting Matlab and reopening it, every non-standard character becomes a question-mark (meaning, both Russian and Hebrew in the previous example).

Given that the diary function deals with automatic I/O, I’m not surprised it behaves that way. Moreover, I’ve seen similar behaviour in GUI components as well. It seems, that Matlab’s char type cannot include the aforementioned alphabets.

Regarding the workaround of diary: @Yair, is there a way to upgrade it to have an automatic behaviour (like diary on and diary off)?

On a personal note, I don’t think that ranting about TMW bug system will do any good. In my personal experience, they do their best to address reported issues; it usually takes them a while to fix these, though. One of the purposes of this blog, IMHO, is to show workarounds and present new ideas—which is much more fruitful than complaining.

Yair Altman December 4, 2014 at 14:39 Reply

@Yaroslav – I plan to wrap it within a diary-compliant File Exchange utility in the near future. Hang on…

Steve Eddins December 10, 2014 at 07:35 Reply

Note: these are my opinions. I’m not speaking on behalf of MathWorks.

By long-standing company habit, MathWorks tends not to talk publicly about internal processes and policies. This habit leads to a fair amount of guesswork by folks who are understandably interested in why we do what we do. I confess that it does make me a little sad when those guesses sometimes assume nefarious intent.

This comment thread includes a lot of speculation about our bug tracking processes. I thought I might be able to clear up a few things about that. First of all, there is only one bug tracking database used by development. It was created in-house more than 20 years ago and has been in continuous use since then. It isn’t SalesForce; that’s used by the tech support organization for tracking customer cases. Development’s tracking database is used for bugs, enhancements, and tasks. The earliest record that I personally created in the system was record number 685 on January 10, 1994. At that time I had been a MathWorks employee for 34 days.

Externally visible bug reports do not come from a separate bug database. Instead, they are published directly from development’s tracking system. The published bug reports are typically written by the development team, with technical, documentation, and quality engineering review.

There are indeed a large number of records in the system. (The latest records have 7 digits!) There are several reasons why the number of records is much larger than what you see in the published bug reports:

* The database has been in use for a very long time, and most of the records in the database are now closed.

* The database is used not just for bugs, but also for enhancements and tasks.

* Bug records in the database include a large number of relatively trivial issues, such as documentation typos. We typically don’t publish bug reports for these.

* Most bugs tracked in the database are found and fixed during product development and before release. They are never experienced by customers.

There is some judgment involved in selecting which bug records are chosen for publication. I will pass along a suggestion to review our procedures and policies to see if we can improve in this area.

There is no nefarious intent in the timing of publishing bug reports. In our regular six-month routine, there is a stage toward the end of the development process when development teams are nagged to make sure their bug reports are up-to-date and to check whether any recently-reported bugs in the system should be published. That typically results in a flurry of new published bug reports.

Yair Altman December 10, 2014 at 07:44 Reply

Many thanks for the clarifications Steve

Steve Eddins December 10, 2014 at 08:09 Reply

Hi again, Yair,

You commented that the MATLAB char type is 2 bytes (sort of correct but not the full story), so you speculate that Chinese, Japanese, and Korean will require a different solution (not correct).

MATLAB uses UTF-16 to represent Unicode characters. Like UTF-8, UTF-16 is a variable-length encoding that is capable of representing all of the more than one million Unicode characters. Some of those characters are represented using one 16-bit code unit, and some are represented using two 16-bit units. There are already localized versions of MATLAB available in Japanese, Korean, and Chinese, so we know this is working.

MATLAB has a very large code base, some of which dates back to long before the modern version of Unicode (2.0, 1996) was in place. The diary function implementation, in particular, is very old. I don’t know what its implementer was thinking, but I’m pretty sure he wasn’t thinking about the not-yet-invented Unicode 2.0. MathWorks began to convert all the string handling in the MATLAB code base to UTF-16-encoded Unicode a while back (ten years?). It’s been a large effort, and we have approached it incrementally. The diary function is one of the last areas of the code base to be updated.

(Again, these are my opinions.)

Yair Altman December 10, 2014 at 08:20 Reply

Thanks again for this tidbit. Please let the relevant people know that the Editor and Desktop also don’t work well with non-ASCII. Other people on this comments thread have also reported similar issues. So diary is certainly not the last remaining non-unicoded issue (and possibly even less important than the other i18n/L10n issues).
Steve Eddins December 10, 2014 at 09:10 Reply

Will do. With respect to relative priority, there are different teams working on different parts of the code base, so completion order doesn’t always equal overall priority order. It depends on what else each team is working on, as well as on the complexity of the required code changes.

Amro December 10, 2014 at 21:45 Reply

@SteveEddins:

Does MATLAB really use UTF-16 for characters, or is it actually UCS-2? [1]

UCS-2 is a fixed-length encoding that uses exactly two bytes for each character. This is an older scheme that precedes the Unicode 2.0 standard. It can only represent code points in the range U+0000 to U+FFFF, called the Basic Multilingual Plane (BMP) [2]. This contains the most frequently used characters from many scripts including Arabic, Hebrew, and Chinese/Japanese/Korean (CJK) [3].

On the other hand, UTF-16 is a variable-length encoding that uses 2 or 4 bytes. It can represent any character in the entire Unicode space. The first BMP plane is encoded identical to UCS-2 with 2 bytes. Supplementary characters in other planes are encoded using 4 bytes with surrogate pairs. This contains stuff like CJK ideographs extensions, historic and dead languages, extra mathematical symbols, Emoji and Emoticons, etc…

From my understanding, MATLAB is limited to the first 65536 code points from plane 0 (which indicates it’s really UCS-2 underneath not UTF-16).
Try the following to confirm [4][5]:

% this works
x = hex2dec('20AC');     % U+20AC == 8364 
c = char(x)
 
% this doesnt work, and gets truncated at 65535
x = hex2dec('10437');     % U+10437 == 66615
c = char(x)
Warning: Out of range or non-integer values truncated
during conversion to character.

Same thing happens with functions like NATIVE2UNICODE and UNICODE2NATIVE, or by opening files using FOPEN.

Some references:
[1]: https://en.wikipedia.org/wiki/UTF-16
[2]: https://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
[3]: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs
[4]: http://www.fileformat.info/info/unicode/char/20ac/index.htm
[5]: http://www.fileformat.info/info/unicode/char/10437/index.htm

—

@YairAltman:

MATLAB is capable of of representing CJK and other non-Latin characters (as long as it lies inside the BMP plane). Older versions had trouble displaying these characters in plots and GUIs (one had to explicitly use Java Swing components to display such text [6]). Fortunately MATLAB R2014b greatly improved the situation [7], and is now capable of displaying multilingual text on its own (assuming a font capable of displaying the text).

Here’s an example in R2014b:

UTF title in Matlab axes

% I hope your blog engine doesnt mess this up!
str = {'你好 (Chinese)', 'こんにちは (Japanese)', '안녕하세요 (Korean)',
  '(Arabic) مرحبا', '(Hebrew) שלום', 'नमस्ते (Hindi)', 'สวัสดี (Thai)',
  'привет (Russian)', 'γεια σας (Greek)', [char(9760:9764) ' (Symbols)'],
  [char([9405 7873 313 317 7897]) ' (Latin)']}'
title(str, 'FontName','SansSerif', 'FontSize',18, 'FontWeight','Normal')

[6]: http://stackoverflow.com/a/6872642/97160
[7]: http://www.mathworks.com/products/matlab/matlab-graphics/#multilingual_text_and_symbols

Yair Altman December 11, 2014 at 01:03

@Amro – non-Latin text used to display correctly in Matlab axes up to around 2010 (I don’t remember which version exactly). It remained broken until it was fixed in 14b’s HG2.

[this part was edited out in order not to confuse readers]

Amro December 11, 2014 at 02:06 Reply
@Yair: Are you sure about UICONTROL? I’ve tried it again in R2014b, and the control displays Hebrew text correctly without any hacks. I’m running Windows 8.1 if that makes any difference:
uicontrol('String',char(1495:1499)) % label shown correctly
uicontrol('String',char(1495:1499)) % label shown correctly
This worked fine in earlier versions as well (as you know uicontrols are based on Swing components). What changed for me in R2014b was the regular plot functions (title, xlabel/ylabel, text, etc..) and the command window, which now support Unicode text (again only the BMP plane).

Yair Altman December 11, 2014 at 02:13
@Amro – you’re right, my bad: I was working on 14a this morning, where uicontrols indeed do not display the non-Latin text without the Java hack. It works correctly in 14b. Sorry for the mixup.

For the reference of others, in 14a and earlier we can use the underlying Java components of the uicontrols to display non-Latin labels (unnecessary in 14b):
hButton = uicontrol('String',char(1495:1499)); % no label displayed in 14a; ok in 14b jButton = findjobj(hButton); jButton.setText(char(1495:1499)); % label shown ok
hButton = uicontrol('String',char(1495:1499)); % no label displayed in 14a; ok in 14b jButton = findjobj(hButton); jButton.setText(char(1495:1499)); % label shown ok

Amro December 11, 2014 at 02:25 Reply
@Yair: you can also make it work in R2014a and earlier, you just have to change the default character set (undocumented of course!):
% works in R2014a feature('DefaultCharacterSet','UTF-8') uicontrol('String',char(1495:1499))
% works in R2014a feature('DefaultCharacterSet','UTF-8') uicontrol('String',char(1495:1499))
On my Windows machine with “en-US” locale, the default charset was set to “windows-1252”.

Yair Altman December 11, 2014 at 15:47

@Amro – where have you been all these years?! – this would have deserved a dedicated post…

Anything else you’re keeping up your sleeve?

Steve Eddins December 11, 2014 at 12:42 Reply

MATLAB uses UTF-16 internally for character representation. Some parts of MATLAB don’t know yet what to do with code points about 65535.

Amro December 12, 2014 at 01:47 Reply

@SteveEddins:

Thanks for the response Steve. I took another shot at the problem, and I’ve managed to work with code points outside the BMP plane this time (i.e U+010000 to U+10FFFF)!

As I’ve shown before, the CHAR function unfortunately does not know how to deal with those code points directly; the trick is to use NATIVE2UNICODE function to convert a vector of bytes (encoded using any of the supported encodings) back to the abstract character they represent.

So my previous example can be written as:

% U+10437 is encoded in UTF-8 as 4-bytes: 0xF0 0x90 0x90 0xB7
cc = native2unicode(hex2dec({'F0' '90' '90' 'B7'})', 'UTF-8')
 
% MATLAB internally uses UTF-16, which in this case
% is encoded as a surrogate pair: 0xD801 0xDC37
>> whos cc
  Name      Size            Bytes  Class    Attributes
  cc        1x2                 4  char               
 
>> double(cc)
ans =
       55297       56375      % high/low surrogates
 
% using an appropriate font, R2014b is now capable of showing it in a plot!
title(cc, 'FontSize',144, 'FontName','Segoe UI Symbol')

This does confirm that MATLAB is using UTF-16 internally to represent characters, even though there are a couple of places that still need attention… Sorry for doubting you guys 🙂

—

I hope Yair doesn’t mind me documenting this stuff here, but in the spirit of filing bugs, here are some of my findings:

– CHAR function truncates characters outside the range of BMP plane.
– even if we managed to store a non-BMP character like I did in the above example, there is a problem of “leaky abstractions”; the resulting character has length(cc)==2, while it’s supposed to be a single character! This means that indexing into a string is messed up as well, and won’t always be what we expect: str(5:10)
– the MATLAB editor does not handle files with BOM markers well (either UTF-8 or UTF-16).
– reading files with UTF-16 text using FOPEN/FREAD functions has some issues. I guess that’s why UTF-16 is not listed as a supported encoding in FREAD doc page, even though it is accepted as an option value.
– the command window in the IDE has small rendering issues with Unicode text that involves combining diacritical-marks. Take for example the letter A with accents above and below: char(hex2dec({‘0061’, ‘0301’, ‘0317’})’) or ‘á̗’
– other advanced Unicode functionality: collation, case mapping, text segmentation, regular expressions, etc.. For instance sort({‘a’, ‘z’, ‘e’, ‘é’, ‘ä’}) sorts by code point value, rather than alphabetically, where the accented characters should come after the regular ones, not last after the ‘z’. Another example is case conversion where in German upper(‘ß’) should be ‘SS’!

It’s worth mentioning that MATLAB is not the only language that suffers from the “UTF-16 curse” (JavaScript comes to mind!). I only wish TMW had chosen UTF-8 instead, which many agree is a better encoding in general:
– http://utf8everywhere.org/
– http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful
This is especially true when working at the level of MEX-files.

Martin December 12, 2014 at 03:32 Reply
Very good and insightful comments here. Personally, I feel that nowadays the most sensible text encoding is UTF-8 and I wish it would become the default in MATLAB sooner rather than later. My workflow is very English-centric, but even I want to credit someone or reference an article once in a while, at which point I want to use all kinds of diacritics and maybe even different alphabets.

My approach to ensure this is: Check in my startup.m that
feature('DefaultCharacterSet')
feature('DefaultCharacterSet')
returns UTF-8. If not, check via
feature('Locale')
feature('Locale')
which locale MATLAB has detected for my system and alter $MLROOT/bin/lcdata.xml accordingly (change the encoding attribute of the matching locale element).

I haven’t really tested it, but I could imagine that there are less issues with this method than changing the encoding via the above feature command, which would only happen after all of MATLAB is loaded and the editor has already reopened my files with the wrong encoding.

I guess the choice of UTF-16 as the internal encoding is understandable, given that this is the encoding on an API level on Windows and also the encoding underlying all of Java.

Ed Yu December 15, 2014 at 13:53 Reply

Hi all,

I’ve recently came across a MATLAB bug that completely blew my mind (for over a year)… Imagine you have an MCR application that executes for the first time, fails the second time (some windows DLL error), works the third time, fails the forth time, so on so forth…

Talked to MATLAB support and they guided us to run tests and gather information, the standard stuff that you do to try and find out what’s wrong. This bug is especially difficult to figure out because the application runs fine with the majority of our clients and we cannot duplicate the issue in house… The clients that ran fine are US clients (hint, hint…) the clients that had the problem are all from outside US (hint, hint…). Imagine we have to coordinate with our clients to perform the information gathering across timezones (we are US Mountain time) and our clients having the issue are from England, Denmark, etc…

So one night as I sit watching TV late at night, a light bulb suddenly lit up… It has something to do with Locale settings (actually it has to do with the default character set) within MATLAB… Our development machines are all running Windows US locale, the machine that produces the MCR executable is also a US locale machine. So after some quick checking setting Windows to non-US locale and running our US locale compiled MCR app, we finally reproduced the error. Subsequently, talking to MathWorks and they admit that it is a known bug and has to do with running a compiled MCR application on another computer with a different locale cause an error message to be displayed in the DOS command window if the MCR application is compiled as a DOS application and we can see the error message. But if the app is compiled as a Windows application (no DOS window while the app is running), the output error message causes MCR to crash…

The workaround is to compile our application as a traditional app with a DOS command window, or alternatively patch our MATLAB R2013b (MCR) or upgrade to MATLAB R2014a (where the bug is fixed).

So the lesson of this story is that it is really unfortunate that MathWorks is not sharing bug reports which can cause a lot of grief for MATLAB programmers…

Lastly, thank you Amro for sharing how to change the default character sets within MATLAB which actually might help us to avoid this bug if you had posted this a couple of weeks earlier…
Unicode paths with MATLAB - HTML CODE January 11, 2016 at 13:39 Reply

[…] useful info on how matlab handles filenames (and characters in general) available in comments of this undocumentedmatlab post (especially steve eddins, works @ mathworks). in […]
GGa April 9, 2019 at 16:41 Reply

There isn’t anything “outside the UTF-8 range”.
UTF-8 can represent any Unicode character, as far as I know.

Yair Altman April 9, 2019 at 17:25 Reply

Typo corrected, meant ASCII of course (ASCII is represented by 1 byte in UTF-8, and 2 bytes in UTF-16)

HTML tags such as <b> or <i> are accepted.
Wrap code fragments inside <pre lang="matlab"> tags, like this:

<pre lang="matlab">
a = magic(3);
disp(sum(a))
</pre>

I reserve the right to edit/delete comments (read the site policies).
Not all comments will be answered. You can always email me (altmany at gmail) for private consulting.

Click here to cancel reply.

Saving non-Latin Command Window text using diary

Printing GUIs reliably

Related posts:

32 Responses

Leave a Reply