I have recently consulted in a project where data was provided in XML strings and needed to be parsed in Matlab memory in an efficient manner (in other words, as quickly as possible). Now granted, XML is rather inefficient in storing data (JSON would be much better for this, for example). But I had to work with the given situation, and that required processing the XML.
I basically had two main alternatives:
- I could either create a dedicated string-parsing function that searches for a particular pattern within the XML string, or
- I could use a standard XML-parsing library to create the XML model and then parse its nodes
The first alternative is quite error-prone, since it relies on the exact format of the data in the XML. Since the same data can be represented in multiple equivalent XML ways, making the string-parsing function robust as well as efficient would be challenging. I was
lazy expedient, so I chose the second alternative.
Unfortunately, Matlab’s xmlread function only accepts input filenames (of *.xml files), it cannot directly parse XML strings. Yummy!
The obvious and simple solution is to simply write the XML string into a temporary *.xml file, read it with xmlread, and then delete the temp file:
% Store the XML data in a temp *.xml file filename = [tempname '.xml']; fid = fopen(filename,'Wt'); fwrite(fid,xmlString); fclose(fid); % Read the file into an XML model object xmlTreeObject = xmlread(filename); % Delete the temp file delete(filename); % Parse the XML model object ...
This works well and we could move on with our short lives. But cases such as this, where a built-in function seems to have a silly limitation, really fire up the investigative reporter in me. I decided to drill into xmlread to discover why it couldn’t parse XML strings directly in memory, without requiring costly file I/O. It turns out that xmlread accepts not just file names as input, but also Java object references (specifically,
org.xml.sax.InputSource). In fact, there are quite a few other inputs that we could use, to specify a validation parser etc. – I wrote about this briefly back in 2009 (along with other similar semi-documented input altermatives in xmlwrite and xslt).
In our case, we could simply send xmlread as input a
java.io.StringBufferInputStream(xmlString) object (which is an instance of
% Read the xml string directly into an XML model object inputObject = java.io.StringBufferInputStream(xmlString); % alternative #1 inputObject = org.xml.sax.InputSource(java.io.StringReader(xmlString)); % alternative #2 xmlTreeObject = xmlread(inputObject); % Parse the XML model object ...
If we don’t want to depend on undocumented functionality (which might break in some future release, although it has remained unchanged for at least the past decade), and in order to improve performance even further by passing xmlread‘s internal validity checks and processing, we can use xmlread‘s core functionality to parse our XML string directly. We can add a fallback to the standard (fully-documented) functionality, just in case something goes wrong (which is good practice whenever using any undocumented functionality):
try % The following avoids the need for file I/O: inputObject = java.io.StringBufferInputStream(xmlString); % or: org.xml.sax.InputSource(java.io.StringReader(xmlString)) try % Parse the input data directly using xmlread's core functionality parserFactory = javaMethod('newInstance','javax.xml.parsers.DocumentBuilderFactory'); p = javaMethod('newDocumentBuilder',parserFactory); xmlTreeObject = p.parse(inputObject); catch % Use xmlread's semi-documented inputObject input feature xmlTreeObject = xmlread(inputObject); end catch % Fallback to standard xmlread usage, using a temporary XML file: % Store the XML data in a temp *.xml file filename = [tempname '.xml']; fid = fopen(filename,'Wt'); fwrite(fid,xmlString); fclose(fid); % Read the file into an XML model object xmlTreeObject = xmlread(filename); % Delete the temp file delete(filename); end % Parse the XML model object ...