Reading non-Latin text files

In the spirit of the Jewish New Year that begins tonight, I would like to share a workaround that I received from blog reader Ro’ee Gilron of Tel-Aviv University:

Matlab users who use non-Latin computer Locales are aware of the issues that the Matlab Command Window has had with such languages for many years. I am not sure whether these problems are due to the LTR nature of Hebrew/Arabic, or their use of a non-supported code-page, or some other reason. To this day (R2011b), I am not aware of any fix or workaround for these issues.

But it seems that in addition, Matlab has a problem reading files that contain text in these languages, even when the computer’s Locale is set correctly, to a Locale that supports the non-Latin text. This is where Ro’ee’s workaround helps. In his words:

To give some more background, this used to work with a 32bit system, and an older version of Matlab (7.1). Now it doesn’t. Saving the file in UTF-8 and using fopen and textscan instead of importdata gives me this:

nowords =
‘שלבק’
‘התלכב’
‘× ×™×›×˜×¨’
‘תלפורש’
‘×œ×§×˜× ‘
‘מזוחש’
‘שלטיק’
‘טיבר’
‘עולג’
‘סלבוחד’
‘משוחגות’
‘מלוגסות’
‘סבק’
‘צמשר’
‘הכריב’
‘תמציל’

The solution is as follows (requires Simulink):

1) Change system Locale to Hebrew: http://windows.microsoft.com/en-US/windows7/Change-the-system-locale

(this doesn’t change the language of the OS etc.).

2) Change the encoding that Matlab uses:
http://www.mathworks.com/help/toolbox/simulink/slref/slcharacterencoding.html

They tell you not to, but I did… – you must change it to encoding that works for Hebrew: http://www.iana.org/assignments/character-sets

Any other language should work as well (I hope…). For Hebrew the code that works for me is ISO_8859-8

3) You should now be able to read TXT files that have Hebrew characters in them.

>> a='הצלחה!'
a =
!
 
>> currentCharacterEncoding = slCharacterEncoding();
>> currentCharacterEncoding = get_param(0, 'CharacterEncoding')  % equivalent alternative
currentCharacterEncoding =
windows-1252
 
% Now modify the default encoding to something more useful
>> slCharacterEncoding('ISO_8859-8')
>> set_param(0, 'CharacterEncoding', 'ISO_8859-8');   % equivalent alternative
 
>> currentCharacterEncoding = slCharacterEncoding()
currentCharacterEncoding =
ISO-8859-8
 
>> a='הצלחה!'
a =
!                  % still no good in the Command Window...
 
% Let's try to read a file with some Hebrew words:
>> neutral = importdata('neutral.txt')
neutral = 
שולחן'
    'כסא'
    'מנורה'
    'צלחת'
    'סיר'
    'מזלג'

So, it appears that while we did not solve the problems with the Command Window, at least we can now read the prayer book for our New Year prayers…

Let this be a year of fulfillment, prosperity, health and happiness to all. Shana Tova everybody!

Categories: Low risk of breaking in future versions, Stock Matlab function

Tags: ,

Bookmark and SharePrint Print

2 Responses to Reading non-Latin text files

  1. Nir says:

    Do you know how can I change character encoding from within a compiled code.

    ie. set_param(0, ‘CharacterEncoding’, ‘ISO_8859-8’) could not be added to the matlab compiled exe file.

    Thanks

    • Try to place this command in a startup.m file in your code folder, and then recompile your application. I’m not sure it will help, but it’s worth a try.

Leave a Reply

Your email address will not be published. Required fields are marked *