Parsing XML strings

I have recently consulted in a project where data was provided in XML strings and needed to be parsed in Matlab memory in an efficient manner (in other words, as quickly as possible). Now granted, XML is rather inefficient in storing data (JSON would be much better for this, for example). But I had to work with the given situation, and that required processing the XML.

I basically had two main alternatives:

  • I could either create a dedicated string-parsing function that searches for a particular pattern within the XML string, or
  • I could use a standard XML-parsing library to create the XML model and then parse its nodes

The first alternative is quite error-prone, since it relies on the exact format of the data in the XML. Since the same data can be represented in multiple equivalent XML ways, making the string-parsing function robust as well as efficient would be challenging. I was lazy expedient, so I chose the second alternative.

Unfortunately, Matlab’s xmlread function only accepts input filenames (of *.xml files), it cannot directly parse XML strings. Yummy!

The obvious and simple solution is to simply write the XML string into a temporary *.xml file, read it with xmlread, and then delete the temp file:

% Store the XML data in a temp *.xml file
filename = [tempname '.xml'];
fid = fopen(filename,'Wt');
fwrite(fid,xmlString);
fclose(fid);
 
% Read the file into an XML model object
xmlTreeObject = xmlread(filename);
 
% Delete the temp file
delete(filename);
 
% Parse the XML model object
...

This works well and we could move on with our short lives. But cases such as this, where a built-in function seems to have a silly limitation, really fire up the investigative reporter in me. I decided to drill into xmlread to discover why it couldn’t parse XML strings directly in memory, without requiring costly file I/O. It turns out that xmlread accepts not just file names as input, but also Java object references (specifically, java.io.File, java.io.InputStream or org.xml.sax.InputSource). In fact, there are quite a few other inputs that we could use, to specify a validation parser etc. – I wrote about this briefly back in 2009 (along with other similar semi-documented input altermatives in xmlwrite and xslt).

In our case, we could simply send xmlread as input a java.io.StringBufferInputStream(xmlString) object (which is an instance of java.io.InputStream) or org.xml.sax.InputSource(java.io.StringReader(xmlString)):

% Read the xml string directly into an XML model object
inputObject = java.io.StringBufferInputStream(xmlString);                % alternative #1
inputObject = org.xml.sax.InputSource(java.io.StringReader(xmlString));  % alternative #2
 
xmlTreeObject = xmlread(inputObject);
 
% Parse the XML model object
...

If we don’t want to depend on undocumented functionality (which might break in some future release, although it has remained unchanged for at least the past decade), and in order to improve performance even further by passing xmlread‘s internal validity checks and processing, we can use xmlread‘s core functionality to parse our XML string directly. We can add a fallback to the standard (fully-documented) functionality, just in case something goes wrong (which is good practice whenever using any undocumented functionality):

try
    % The following avoids the need for file I/O:
    inputObject = java.io.StringBufferInputStream(xmlString);  % or: org.xml.sax.InputSource(java.io.StringReader(xmlString))
    try
        % Parse the input data directly using xmlread's core functionality
        parserFactory = javaMethod('newInstance','javax.xml.parsers.DocumentBuilderFactory');
        p = javaMethod('newDocumentBuilder',parserFactory);
        xmlTreeObject = p.parse(inputObject);
    catch
        % Use xmlread's semi-documented inputObject input feature
        xmlTreeObject = xmlread(inputObject);
    end
catch
    % Fallback to standard xmlread usage, using a temporary XML file:
 
    % Store the XML data in a temp *.xml file
    filename = [tempname '.xml'];
    fid = fopen(filename,'Wt');
    fwrite(fid,xmlString);
    fclose(fid);
 
    % Read the file into an XML model object
    xmlTreeObject = xmlread(filename);
 
    % Delete the temp file
    delete(filename);
end
 
% Parse the XML model object
...
Categories: Low risk of breaking in future versions, Semi-documented feature, Stock Matlab function

Tags: , , ,

Bookmark and SharePrint Print

4 Responses to Parsing XML strings

  1. @shsteimer I am passing in xml string and it is returning null. It does not throw any exception. What must be wrong?

  2. Ondrej says:

    Hi Yair,
    I am curious about some profiling and how fast each of the method is.
    Thanks.

  3. Michal Kvasnicka says:

    @Yair Undocumented functionality (first code box) does not work at all (R2017b)!!!
    I always got null result. Any workaround?

    • @Michal – when you say that something “does not work at all” you provide zero information about what you tried to do and what happened. If you expect me to spend time to try to help you, then the minimum that you should do is to spend the time to include useful information that could help diagnose the problem. If you don’t have the time to add this information, then I don’t have the time to assist you, sorry.

Leave a Reply to americanlamboard.com Cancel reply

Your email address will not be published. Required fields are marked *