Undocumented Matlab
  • SERVICES
    • Consulting
    • Development
    • Training
    • Gallery
    • Testimonials
  • PRODUCTS
    • IQML: IQFeed-Matlab connector
    • IB-Matlab: InteractiveBrokers-Matlab connector
    • EODML: EODHistoricalData-Matlab connector
    • Webinars
  • BOOKS
    • Secrets of MATLAB-Java Programming
    • Accelerating MATLAB Performance
    • MATLAB Succinctly
  • ARTICLES
  • ABOUT
    • Policies
  • CONTACT
  • SERVICES
    • Consulting
    • Development
    • Training
    • Gallery
    • Testimonials
  • PRODUCTS
    • IQML: IQFeed-Matlab connector
    • IB-Matlab: InteractiveBrokers-Matlab connector
    • EODML: EODHistoricalData-Matlab connector
    • Webinars
  • BOOKS
    • Secrets of MATLAB-Java Programming
    • Accelerating MATLAB Performance
    • MATLAB Succinctly
  • ARTICLES
  • ABOUT
    • Policies
  • CONTACT

Parsing XML strings

February 1, 2017 8 Comments

I have recently consulted in a project where data was provided in XML strings and needed to be parsed in Matlab memory in an efficient manner (in other words, as quickly as possible). Now granted, XML is rather inefficient in storing data (JSON would be much better for this, for example). But I had to work with the given situation, and that required processing the XML.
I basically had two main alternatives:

  • I could either create a dedicated string-parsing function that searches for a particular pattern within the XML string, or
  • I could use a standard XML-parsing library to create the XML model and then parse its nodes

The first alternative is quite error-prone, since it relies on the exact format of the data in the XML. Since the same data can be represented in multiple equivalent XML ways, making the string-parsing function robust as well as efficient would be challenging. I was lazy expedient, so I chose the second alternative.
Unfortunately, Matlab’s xmlread function only accepts input filenames (of *.xml files), it cannot directly parse XML strings. Yummy!
The obvious and simple solution is to simply write the XML string into a temporary *.xml file, read it with xmlread, and then delete the temp file:

% Store the XML data in a temp *.xml file
filename = [tempname '.xml'];
fid = fopen(filename,'Wt');
fwrite(fid,xmlString);
fclose(fid);
% Read the file into an XML model object
xmlTreeObject = xmlread(filename);
% Delete the temp file
delete(filename);
% Parse the XML model object
...

% Store the XML data in a temp *.xml file filename = [tempname '.xml']; fid = fopen(filename,'Wt'); fwrite(fid,xmlString); fclose(fid); % Read the file into an XML model object xmlTreeObject = xmlread(filename); % Delete the temp file delete(filename); % Parse the XML model object ...

This works well and we could move on with our short lives. But cases such as this, where a built-in function seems to have a silly limitation, really fire up the investigative reporter in me. I decided to drill into xmlread to discover why it couldn’t parse XML strings directly in memory, without requiring costly file I/O. It turns out that xmlread accepts not just file names as input, but also Java object references (specifically, java.io.File, java.io.InputStream or org.xml.sax.InputSource). In fact, there are quite a few other inputs that we could use, to specify a validation parser etc. – I wrote about this briefly back in 2009 (along with other similar semi-documented input altermatives in xmlwrite and xslt).

In our case, we could simply send xmlread as input a java.io.StringBufferInputStream(xmlString) object (which is an instance of java.io.InputStream) or org.xml.sax.InputSource(java.io.StringReader(xmlString)):

% Read the xml string directly into an XML model object
inputObject = java.io.StringBufferInputStream(xmlString);                % alternative #1
inputObject = org.xml.sax.InputSource(java.io.StringReader(xmlString));  % alternative #2
xmlTreeObject = xmlread(inputObject);
% Parse the XML model object
...

% Read the xml string directly into an XML model object inputObject = java.io.StringBufferInputStream(xmlString); % alternative #1 inputObject = org.xml.sax.InputSource(java.io.StringReader(xmlString)); % alternative #2 xmlTreeObject = xmlread(inputObject); % Parse the XML model object ...

If we don’t want to depend on undocumented functionality (which might break in some future release, although it has remained unchanged for at least the past decade), and in order to improve performance even further by passing xmlread‘s internal validity checks and processing, we can use xmlread‘s core functionality to parse our XML string directly. We can add a fallback to the standard (fully-documented) functionality, just in case something goes wrong (which is good practice whenever using any undocumented functionality):

try
    % The following avoids the need for file I/O:
    inputObject = java.io.StringBufferInputStream(xmlString);  % or: org.xml.sax.InputSource(java.io.StringReader(xmlString))
    try
        % Parse the input data directly using xmlread's core functionality
        parserFactory = javaMethod('newInstance','javax.xml.parsers.DocumentBuilderFactory');
        p = javaMethod('newDocumentBuilder',parserFactory);
        xmlTreeObject = p.parse(inputObject);
    catch
        % Use xmlread's semi-documented inputObject input feature
        xmlTreeObject = xmlread(inputObject);
    end
catch
    % Fallback to standard xmlread usage, using a temporary XML file:
    % Store the XML data in a temp *.xml file
    filename = [tempname '.xml'];
    fid = fopen(filename,'Wt');
    fwrite(fid,xmlString);
    fclose(fid);
    % Read the file into an XML model object
    xmlTreeObject = xmlread(filename);
    % Delete the temp file
    delete(filename);
end
% Parse the XML model object
...

try % The following avoids the need for file I/O: inputObject = java.io.StringBufferInputStream(xmlString); % or: org.xml.sax.InputSource(java.io.StringReader(xmlString)) try % Parse the input data directly using xmlread's core functionality parserFactory = javaMethod('newInstance','javax.xml.parsers.DocumentBuilderFactory'); p = javaMethod('newDocumentBuilder',parserFactory); xmlTreeObject = p.parse(inputObject); catch % Use xmlread's semi-documented inputObject input feature xmlTreeObject = xmlread(inputObject); end catch % Fallback to standard xmlread usage, using a temporary XML file: % Store the XML data in a temp *.xml file filename = [tempname '.xml']; fid = fopen(filename,'Wt'); fwrite(fid,xmlString); fclose(fid); % Read the file into an XML model object xmlTreeObject = xmlread(filename); % Delete the temp file delete(filename); end % Parse the XML model object ...

Related posts:

  1. Parsing mlint (Code Analyzer) output – The Matlab Code Analyzer (mlint) has a lot of undocumented functionality just waiting to be used. ...
  2. File deletion memory leaks, performance – Matlab's delete function leaks memory and is also slower than the equivalent Java function. ...
  3. Matlab compiler bug and workaround – Both the Matlab compiler and the publish function have errors when parsing block-comments in Matlab m-code. ...
  4. Undocumented XML functionality – Matlab's built-in XML-processing functions have several undocumented features that can be used by Java-savvy users...
  5. Matlab-Java memory leaks, performance – Internal fields of Java objects may leak memory - this article explains how to avoid this without sacrificing performance. ...
  6. Improving fwrite performance – Standard file writing performance can be improved in Matlab in surprising ways. ...
Java Performance Semi-documented feature XML
Print Print
« Previous
Next »
8 Responses
  1. americanlamboard.com February 14, 2017 at 11:55 Reply

    @shsteimer I am passing in xml string and it is returning null. It does not throw any exception. What must be wrong?

  2. Ondrej March 22, 2017 at 09:47 Reply

    Hi Yair,
    I am curious about some profiling and how fast each of the method is.
    Thanks.

  3. Michal Kvasnicka October 9, 2017 at 17:16 Reply

    @Yair Undocumented functionality (first code box) does not work at all (R2017b)!!!
    I always got null result. Any workaround?

    • Yair Altman October 11, 2017 at 02:44 Reply

      @Michal – when you say that something “does not work at all” you provide zero information about what you tried to do and what happened. If you expect me to spend time to try to help you, then the minimum that you should do is to spend the time to include useful information that could help diagnose the problem. If you don’t have the time to add this information, then I don’t have the time to assist you, sorry.

  4. Peter Cook October 26, 2017 at 21:58 Reply

    Yair,

    I have an application I wrote that inhales rather large XML files using xml2struct and the performance is less-than-ideal. I’ve got an idea from a different blog post of yours about faster ASCII file parsing via fread. The idea would be to inhale the xml file in binary mode, and look for ASCII 60 and ASCII 62 and partition the data from that starting point. Do you think this approach would lend itself to faster xml reads?

    • Yair Altman October 26, 2017 at 22:35 Reply

      @Peter – in general, yes it would indeed be faster but you need to take into consideration that XML is much more than simple tags enclosed within < and >. Tags can have attributes, and comments and binary CData etc. etc. If your XML is very simple and does not contain such nuisances then text parsing would be simple and faster, but if there is a chance that the XML file contains them then you’d better off rely on the more general parser for robustness.

  5. KE November 19, 2019 at 14:54 Reply

    Once you read your XML information into Matlab, what kind of ‘container’ do you find it most useful for handing the data? I have tried nonscalar structure arrays (because that’s what xml2struct produces) but it is hard to extract field data across records, for example to make a plot. Is there a better way?

    • Yair Altman November 19, 2019 at 15:41 Reply

      I typically use structs, due to the easy and efficient way that I can aggregate data from different nodes (as long as the nesting is not too deep). For example, [data.age] or {data.name}.

      There are numerous versions of xml2struct on the Matlab File Exchange (link). Depending on your specific needs, you may find that some versions produce cleaner or more compact structs. I use a self-modified version of Wouter Falkena’s utility. My version handles a bunch of edge cases (CDATA entries etc.), and makes the resulting struct cleaner and more compact where possible – it can be downloaded here.

Leave a Reply
HTML tags such as <b> or <i> are accepted.
Wrap code fragments inside <pre lang="matlab"> tags, like this:
<pre lang="matlab">
a = magic(3);
disp(sum(a))
</pre>
I reserve the right to edit/delete comments (read the site policies).
Not all comments will be answered. You can always email me (altmany at gmail) for private consulting.

Click here to cancel reply.

Useful links
  •  Email Yair Altman
  •  Subscribe to new posts (feed)
  •  Subscribe to new posts (reader)
  •  Subscribe to comments (feed)
 
Accelerating MATLAB Performance book
Recent Posts

Speeding-up builtin Matlab functions – part 3

Improving graphics interactivity

Interesting Matlab puzzle – analysis

Interesting Matlab puzzle

Undocumented plot marker types

Matlab toolstrip – part 9 (popup figures)

Matlab toolstrip – part 8 (galleries)

Matlab toolstrip – part 7 (selection controls)

Matlab toolstrip – part 6 (complex controls)

Matlab toolstrip – part 5 (icons)

Matlab toolstrip – part 4 (control customization)

Reverting axes controls in figure toolbar

Matlab toolstrip – part 3 (basic customization)

Matlab toolstrip – part 2 (ToolGroup App)

Matlab toolstrip – part 1

Categories
  • Desktop (45)
  • Figure window (59)
  • Guest bloggers (65)
  • GUI (165)
  • Handle graphics (84)
  • Hidden property (42)
  • Icons (15)
  • Java (174)
  • Listeners (22)
  • Memory (16)
  • Mex (13)
  • Presumed future risk (394)
    • High risk of breaking in future versions (100)
    • Low risk of breaking in future versions (160)
    • Medium risk of breaking in future versions (136)
  • Public presentation (6)
  • Semi-documented feature (10)
  • Semi-documented function (35)
  • Stock Matlab function (140)
  • Toolbox (10)
  • UI controls (52)
  • Uncategorized (13)
  • Undocumented feature (217)
  • Undocumented function (37)
Tags
AppDesigner (9) Callbacks (31) Compiler (10) Desktop (38) Donn Shull (10) Editor (8) Figure (19) FindJObj (27) GUI (141) GUIDE (8) Handle graphics (78) HG2 (34) Hidden property (51) HTML (26) Icons (9) Internal component (39) Java (178) JavaFrame (20) JIDE (19) JMI (8) Listener (17) Malcolm Lidierth (8) MCOS (11) Memory (13) Menubar (9) Mex (14) Optical illusion (11) Performance (78) Profiler (9) Pure Matlab (187) schema (7) schema.class (8) schema.prop (18) Semi-documented feature (6) Semi-documented function (33) Toolbar (14) Toolstrip (13) uicontrol (37) uifigure (8) UIInspect (12) uitable (6) uitools (20) Undocumented feature (187) Undocumented function (37) Undocumented property (20)
Recent Comments
Contact us
Captcha image for Custom Contact Forms plugin. You must type the numbers shown in the image
Undocumented Matlab © 2009 - Yair Altman
This website and Octahedron Ltd. are not affiliated with The MathWorks Inc.; MATLAB® is a registered trademark of The MathWorks Inc.
Scroll to top