I would like to welcome guest blogger Jim Hokanson. Today, Jim will explain some of the limitations that at one time or another many of us have encountered with Matlab’s built-in urlread function. More importantly, Jim has spent a lot of time in creating an expanded-capabilities Matlab function, which he explains below. Note that while urlread‘s internals are undocumented, the changes outlined below rely on pretty standard Java and HTTP, and should therefore be quite safe to use on multiple Matlab versions.
Abstract
I recently tried to implement the Mendeley API but quickly found that urlread was not going to be sufficient for my needs. The first indication of this was my inability to send the proper authorization information for POST requests. It became even more obvious with the need to perform DELETE and PUT methods, since urlread only supports GET and POST. My implementation of urlread, which I refer to as urlread2, addresses these and a couple of other issues and can be found on the Matlab File Exchange. Other developers have tackled urlread‘s shortcomings (this example added timeout support, and this example added support for binary file upload) – today’s article will focus on my solution, but others are obviously possible.
Introduction
HTTP is the underlying computer networking protocol that enables us to read webpages on the Internet. It consists of a request made by the user to an Internet server (typically located via URL), and a response from that server. Importantly, the request and response consist of three main parts: a resource line (for requests) or status line (for responses), followed by headers, and optionally a message body.
Matlab’s built-in urlread function enables Matlab users to easily read the server’s response text into a Matlab string:
text = urlread('http://www.google.com'); |
This is done internally using Java code that connects to the specified URL and reads the information sent by the URL’s server (more on this).
urlread accepts optional additional inputs specifying the request type (‘get’ or ‘post’) and parameter values for the request.
Unfortunately, urlread has the following limitations:
- It does not allow specification of request headers
- It makes assumptions as to the request headers needed based on the input method
- It does not expose the response headers and status line
- It assumes the response body contains text, and not a binary payload
- It does not enable uploading binary contents to the server
- It does not enable specifying a timeout in case the server is not responding
urlread2
The urlread2 function addresses all of these problems. The overall design decision for this function was to make it more general, requiring more work up front to use in some cases, but more flexibility.
For reference, the following is the calling format for urlread2 (which is reminiscent of urlread‘s):
urlread2(url,*method,*body,*headersIn, varargin) |
The * indicate optional inputs that must be spatially maintained.
- url – (string), url to request
- method – (string, default GET) HTTP request method
- body – (string, default ”), body of the request
- headersIn – (structure, default []), see the following section
- varargin – extra properties that need to be specified via property/pair values
Addressing Problem 1 – Request header
urlread internally uses a Java object called urlConnection
that is generally an instance of the class sun.net.www.protocol.http.HttpURLConnection
. The method setRequestProperty() can be used to set headers for the request. This method has two inputs, the header name and the value of that header. A simple example of this can be seen below:
urlConnection.setRequestProperty('Content-Type','application/x-www-form-urlencoded'); |
Here ‘Content-Type’ is the header name and the second input is the value of that property. My function requires passing in nearly all headers as a structure array, with fields for the name and value. The preceding header would be created using a helper function http_createHeader.m:
header = http_createHeader('Content-Type','application/x-www-form-urlencoded'); |
Multiple headers can be passed in to the function by concatenating header structures into a structure array.
Addressing Problem 2 – Request parameters
When making a POST request, parameters are generally specified in the message body using the following format:
[property]=[value]&[property]=[value]
The properties and values are also encoded in a particular way, generally termed urlencoded (encoding and decoding can be done using Matlab’s built-in urlencode and urldecode functions). For GET requests this string is appended to the url with the “?” symbol. Since urlencoding methods can vary, and in the spirit of reducing assumptions, I use separate functions to generate these strings outside of urlread2, and then pass the result in either as the url (for GET) or as the body input (for POST). As an example, I might search the Mathworks website using the upper right search bar on its site for “undocumented matlab” under file exchange (hmmm… pretty cute stuff there!). Doing this performs a GET request with the following property/value pairs:
params = {'search_submit','fileexchange', 'term','undocumented matlab', 'query','undocumented matlab'}; |
These property/value pairs are somewhat obvious from looking at the URL, but could also be determined by using programs such as Fiddler, Firebug, or HttpWatch.
After urlencoding and concatenating, we would form the following string:
search_submit=fileexchange&term=undocumented+matlab&query=undocumented+matlab
This functionality is normally accomplished internally in urlread, but I use a function http_paramsToString to produce that result. That function also returns the required header for POST requests. The following is an example of both GET and POST requests:
[queryString,header] = http_paramsToString(params,1); % For GET: url = [url '?' queryString]; urlread2(url) % For POST: urlread2(url,'POST',queryString,header) |
Addressing Problem 3 – Response header
According to the HTTP protocol, each server response starts with a simple header that indicates a numeric response status. The following Matlab code provides access to the status line using the urlConnection
object:
status = struct('value',urlConnection.getResponseCode(), 'msg',char(urlConnection.getResponseMessage)) status = value: 200 msg: 'OK' |
urlConnection
‘s getHeaderField() and getHeaderFieldKey() methods enable reading the specific parts of the response header:
headerValue = char(urlConnection.getHeaderField(headerIndex)); headerName = char(urlConnection.getHeaderFieldKey(headerIndex)); |
headerIndex
starts at 0 and increases by 1 until both headerValue
and headerName
return empty.
It is important to note that header keys (names) can be repeated for different values. Sometimes this is desired, such as if there are multiple cookies being sent to the user. To generically handle this case, two header structures are returned. In both cases the header names are the field names in the structure, after replacing hyphens with underscores. In one case, allHeaders, the values are cell arrays of strings containing all values presented with the particular key. The other structure, firstHeaders, contains only the first instance of the header as a string to avoid needing to dereference a cell array.
Addressing Problem 4 – Response body
urlread assumes text output. This is fine for most webpages, which use HTML and are therefore text-based. However, urlread fails when trying to download any non-text resource such as an image, a ZIP file, or a PDF document. I have added a flag in urlread2 called CAST_OUTPUT, which defaults to true, i.e. text response, just as urlread assumes. Using varargin, this flag can be set to false ({‘CAST_OUTPUT’,false}) to indicate a binary response.
Summary
urlread2‘s functionality has been expanded to also address other limitations of urlread: It enables binary inputs, better character-set handling of the output, redirection following, and read timeouts.
The modifications described above provide direct access to the key components of the HTTP request and response messages. Its more generic nature lets urlread2 focus on HTTP transmission, and leaves request formation and response interpretation up to the user. I think ultimately this approach is better than providing one-off modifications of the original urlread function to suit a particular need. urlread2 and supporting files can be found on the Matlab File Exchange.
Thanks for this!
I am trying to use urlread to scrape a website for stats
eg.
http://www.footywire.com/afl/footy/ft_match_statistics?mid=5343
which works with a browser but not with urlread which returns
I figure I can use urlread2 to do the refresh for me somehow but as I am fairly inexperienced with scraping I dont know exactly how to do this.
I imagine it would be fairly easy.
Can you give me a tip please?
I found it much easier to create a urlreadbin.m file based on urlread with a single line change.
Edit urlread.m in the iofun directory using “
open urlread.m
” and change the following:Save as urlreadbin.m in the iofun directory.
This eliminates the bothersome unicode conversion.
This is only for reads but grabs jpg files without an issue.
Hi, i’m trying to read an image using urlread2
Images is on http://192.168.1.20/snapshot.cgi how can i do that? Or how can i do this with imread?
Thanks!
How can I used urlread2 with basic authentication.
Thanks
Can I ask if you solve the problem by using urlread2 to do a authentication? thank you
Hello, is the timeout property working ?
I have matlab r2011b but cannot make it work.
my query: urlread2(‘http://www.google.com’, ‘GET’, ”, [], ‘READ_TIMEOUT’, 100);
I tried 1, 100, 1000 as I wasn’t sure of the unit, but all of them time out in ~20seconds.
adding a value for setConnectTimeout fixed it
is there a way to make the function work for local files/machines ?
I am trying to find a way to check the connection to a local server faster than doing exist(path, ‘dir’). this function has a 20s time out
I’m trying to post some data to local web server. I could not get any result. could you help me to identify the problem in my code.
@ibrahim – this is not a general-purpose blog but one that is devoted to undocumented and advanced aspects of Matlab. With regular Matlab questions you will find more luck at the Matlab answers forum.
I would try checking if the web-page is accessible (and accepts POST queries) using a regular browser, before trying in Matlab’s urlread. It’s a localhost server so maybe you forgot to turn it on. Also, maybe it works via cookies, which urlread does not support.
I was using urlread2 for the first time yesterday and encountered a very strange error. The script was very simple – containing just one line calling the urlread2 function. The script froze during execution, and Matlab produced errors relating to the “rmdir” command. When I checked the folder I was working in, all files and folders except the currently running script file had been deleted. Can you offer any explanation as to why this might have happened? Please see the Matlab output below:
“Operation terminated by user during urlread2 (line 199)
In TestDatabaseRead (line 3)
str =
urlread2(‘http://localhost:13595/I/GetLocalData’);
Error using rmdir”
Just an update to my previous post – I have tried running the same code on another machine and had no issues at all. I have no idea what happened, but it must have been my own fault somehow. The GET functionality is so much faster using this than the urlread method supplied with Matlab. Jim was also fantastic, replying to my e-mail query right away.
Hi, I’m trying to use the urlread2 function for sending PUT requests. I’m new to webservices for that I need a little help.
I want to do this
What do I have to do in Matlab to perform this action?
Hello,
I am having performance issues with urlread and wonder whether urlread2 would be faster. A simple 35000 character webpage
A=urlread(‘http://www.mathworks.com’);
takes about 700ms to read and this seems to be rather matlab/hardware/internet connection independent as other people have gotten similar results. I posted my original question here:
http://www.mathworks.nl/matlabcentral/answers/144600-how-can-i-make-urlread-faster-problem-using-urlread2
So I am wondering if there are any ways to speed up urlread? Or is urlread2 significantly faster? Unfortunately I could not get urlread2 working in my old Matlab 7. It gave errors as explained in the linked question.
I feel that performance should be much better. A simple Autohotkey command UrlDownloadToFile does this much faster but it is painful to make different programs operate together.
Any help is greatly appreciated. Thank you for any help in advance!
[…] you want more advanced features checkout urlread2 by Jim Hokanson. It is extending urlread and makes it very easy todo that stuff with […]
Hi, how can I create a persistent http session in MATLAB? I first do a login to the website homepage which requires password to open it. And then in order to manipulate other pages I could change the URLs at my choice but staying logged in! Thanks and please comment.
[…] if you want more advanced features checkout urlread2 by Jim Hokanson. It is extending urlread and makes it very easy to do that stuff with Matlab. […]