I have recently consulted in a project where data was provided in XML strings and needed to be parsed in Matlab memory in an efficient manner (in other words, as quickly as possible). Now granted, XML is rather inefficient in storing data (JSON would be much better for this, for example). But I had to work with the given situation, and that required processing the XML.
I basically had two main alternatives:
- I could either create a dedicated string-parsing function that searches for a particular pattern within the XML string, or
- I could use a standard XML-parsing library to create the XML model and then parse its nodes
The first alternative is quite error-prone, since it relies on the exact format of the data in the XML. Since the same data can be represented in multiple equivalent XML ways, making the string-parsing function robust as well as efficient would be challenging. I was lazy expedient, so I chose the second alternative.
Unfortunately, Matlab’s xmlread function only accepts input filenames (of *.xml files), it cannot directly parse XML strings. Yummy!
The obvious and simple solution is to simply write the XML string into a temporary *.xml file, read it with xmlread, and then delete the temp file:
% Store the XML data in a temp *.xml file filename = [tempname '.xml']; fid = fopen(filename,'Wt'); fwrite(fid,xmlString); fclose(fid); % Read the file into an XML model object xmlTreeObject = xmlread(filename); % Delete the temp file delete(filename); % Parse the XML model object ... |
This works well and we could move on with our short lives. But cases such as this, where a built-in function seems to have a silly limitation, really fire up the investigative reporter in me. I decided to drill into xmlread to discover why it couldn’t parse XML strings directly in memory, without requiring costly file I/O. It turns out that xmlread accepts not just file names as input, but also Java object references (specifically, java.io.File
, java.io.InputStream
or org.xml.sax.InputSource
). In fact, there are quite a few other inputs that we could use, to specify a validation parser etc. – I wrote about this briefly back in 2009 (along with other similar semi-documented input altermatives in xmlwrite and xslt).
In our case, we could simply send xmlread as input a java.io.StringBufferInputStream(xmlString)
object (which is an instance of java.io.InputStream
) or org.xml.sax.InputSource(java.io.StringReader(xmlString))
:
% Read the xml string directly into an XML model object inputObject = java.io.StringBufferInputStream(xmlString); % alternative #1 inputObject = org.xml.sax.InputSource(java.io.StringReader(xmlString)); % alternative #2 xmlTreeObject = xmlread(inputObject); % Parse the XML model object ... |
If we don’t want to depend on undocumented functionality (which might break in some future release, although it has remained unchanged for at least the past decade), and in order to improve performance even further by passing xmlread‘s internal validity checks and processing, we can use xmlread‘s core functionality to parse our XML string directly. We can add a fallback to the standard (fully-documented) functionality, just in case something goes wrong (which is good practice whenever using any undocumented functionality):
try % The following avoids the need for file I/O: inputObject = java.io.StringBufferInputStream(xmlString); % or: org.xml.sax.InputSource(java.io.StringReader(xmlString)) try % Parse the input data directly using xmlread's core functionality parserFactory = javaMethod('newInstance','javax.xml.parsers.DocumentBuilderFactory'); p = javaMethod('newDocumentBuilder',parserFactory); xmlTreeObject = p.parse(inputObject); catch % Use xmlread's semi-documented inputObject input feature xmlTreeObject = xmlread(inputObject); end catch % Fallback to standard xmlread usage, using a temporary XML file: % Store the XML data in a temp *.xml file filename = [tempname '.xml']; fid = fopen(filename,'Wt'); fwrite(fid,xmlString); fclose(fid); % Read the file into an XML model object xmlTreeObject = xmlread(filename); % Delete the temp file delete(filename); end % Parse the XML model object ... |
@shsteimer I am passing in xml string and it is returning null. It does not throw any exception. What must be wrong?
Hi Yair,
I am curious about some profiling and how fast each of the method is.
Thanks.
@Yair Undocumented functionality (first code box) does not work at all (R2017b)!!!
I always got null result. Any workaround?
@Michal – when you say that something “does not work at all” you provide zero information about what you tried to do and what happened. If you expect me to spend time to try to help you, then the minimum that you should do is to spend the time to include useful information that could help diagnose the problem. If you don’t have the time to add this information, then I don’t have the time to assist you, sorry.
Yair,
I have an application I wrote that inhales rather large XML files using xml2struct and the performance is less-than-ideal. I’ve got an idea from a different blog post of yours about faster ASCII file parsing via fread. The idea would be to inhale the xml file in binary mode, and look for ASCII 60 and ASCII 62 and partition the data from that starting point. Do you think this approach would lend itself to faster xml reads?
@Peter – in general, yes it would indeed be faster but you need to take into consideration that XML is much more than simple tags enclosed within < and >. Tags can have attributes, and comments and binary CData etc. etc. If your XML is very simple and does not contain such nuisances then text parsing would be simple and faster, but if there is a chance that the XML file contains them then you’d better off rely on the more general parser for robustness.
Once you read your XML information into Matlab, what kind of ‘container’ do you find it most useful for handing the data? I have tried nonscalar structure arrays (because that’s what xml2struct produces) but it is hard to extract field data across records, for example to make a plot. Is there a better way?
I typically use structs, due to the easy and efficient way that I can aggregate data from different nodes (as long as the nesting is not too deep). For example,
[data.age]
or{data.name}
.There are numerous versions of
xml2struct
on the Matlab File Exchange (link). Depending on your specific needs, you may find that some versions produce cleaner or more compact structs. I use a self-modified version of Wouter Falkena’s utility. My version handles a bunch of edge cases (CDATA entries etc.), and makes the resulting struct cleaner and more compact where possible – it can be downloaded here.