Expanding urlread capabilities

I would like to welcome guest blogger Jim Hokanson. Today, Jim will explain some of the limitations that at one time or another many of us have encountered with Matlab’s built-in urlread function. More importantly, Jim has spent a lot of time in creating an expanded-capabilities Matlab function, which he explains below. Note that while urlread‘s internals are undocumented, the changes outlined below rely on pretty standard Java and HTTP, and should therefore be quite safe to use on multiple Matlab versions.

Abstract

I recently tried to implement the Mendeley API but quickly found that urlread was not going to be sufficient for my needs. The first indication of this was my inability to send the proper authorization information for POST requests. It became even more obvious with the need to perform DELETE and PUT methods, since urlread only supports GET and POST. My implementation of urlread, which I refer to as urlread2, addresses these and a couple of other issues and can be found on the Matlab File Exchange. Other developers have tackled urlread‘s shortcomings (this example added timeout support, and this example added support for binary file upload) – today’s article will focus on my solution, but others are obviously possible.

Introduction

HTTP is the underlying computer networking protocol that enables us to read webpages on the Internet. It consists of a request made by the user to an Internet server (typically located via URL), and a response from that server. Importantly, the request and response consist of three main parts: a resource line (for requests) or status line (for responses), followed by headers, and optionally a message body.

Matlab’s built-in urlread function enables Matlab users to easily read the server’s response text into a Matlab string:

text = urlread('http://www.google.com');

This is done internally using Java code that connects to the specified URL and reads the information sent by the URL’s server (more on this).

urlread accepts optional additional inputs specifying the request type (‘get’ or ‘post’) and parameter values for the request.

Unfortunately, urlread has the following limitations:

  1. It does not allow specification of request headers
  2. It makes assumptions as to the request headers needed based on the input method
  3. It does not expose the response headers and status line
  4. It assumes the response body contains text, and not a binary payload
  5. It does not enable uploading binary contents to the server
  6. It does not enable specifying a timeout in case the server is not responding

urlread2

The urlread2 function addresses all of these problems. The overall design decision for this function was to make it more general, requiring more work up front to use in some cases, but more flexibility.

For reference, the following is the calling format for urlread2 (which is reminiscent of urlread‘s):

urlread2(url,*method,*body,*headersIn, varargin)

The * indicate optional inputs that must be spatially maintained.

  • url – (string), url to request
  • method – (string, default GET) HTTP request method
  • body – (string, default ”), body of the request
  • headersIn – (structure, default []), see the following section
  • varargin – extra properties that need to be specified via property/pair values

Addressing Problem 1 – Request header

urlread internally uses a Java object called urlConnection that is generally an instance of the class sun.net.www.protocol.http.HttpURLConnection. The method setRequestProperty() can be used to set headers for the request. This method has two inputs, the header name and the value of that header. A simple example of this can be seen below:

urlConnection.setRequestProperty('Content-Type','application/x-www-form-urlencoded');

Here ‘Content-Type’ is the header name and the second input is the value of that property. My function requires passing in nearly all headers as a structure array, with fields for the name and value. The preceding header would be created using a helper function http_createHeader.m:

header = http_createHeader('Content-Type','application/x-www-form-urlencoded');

Multiple headers can be passed in to the function by concatenating header structures into a structure array.

Addressing Problem 2 – Request parameters

When making a POST request, parameters are generally specified in the message body using the following format:

[property]=[value]&[property]=[value]

The properties and values are also encoded in a particular way, generally termed urlencoded (encoding and decoding can be done using Matlab’s built-in urlencode and urldecode functions). For GET requests this string is appended to the url with the “?” symbol. Since urlencoding methods can vary, and in the spirit of reducing assumptions, I use separate functions to generate these strings outside of urlread2, and then pass the result in either as the url (for GET) or as the body input (for POST). As an example, I might search the Mathworks website using the upper right search bar on its site for “undocumented matlab” under file exchange (hmmm… pretty cute stuff there!). Doing this performs a GET request with the following property/value pairs:

params = {'search_submit','fileexchange', 'term','undocumented matlab', 'query','undocumented matlab'};

These property/value pairs are somewhat obvious from looking at the URL, but could also be determined by using programs such as Fiddler, Firebug, or HttpWatch.

After urlencoding and concatenating, we would form the following string:

search_submit=fileexchange&term=undocumented+matlab&query=undocumented+matlab

This functionality is normally accomplished internally in urlread, but I use a function http_paramsToString to produce that result. That function also returns the required header for POST requests. The following is an example of both GET and POST requests:

[queryString,header] = http_paramsToString(params,1);
 
% For GET:
url = [url '?' queryString];
urlread2(url)
 
% For POST:
urlread2(url,'POST',queryString,header)

Addressing Problem 3 – Response header

According to the HTTP protocol, each server response starts with a simple header that indicates a numeric response status. The following Matlab code provides access to the status line using the urlConnection object:

status = struct('value',urlConnection.getResponseCode(), 'msg',char(urlConnection.getResponseMessage))
status = 
    value: 200
      msg: 'OK'

urlConnection‘s getHeaderField() and getHeaderFieldKey() methods enable reading the specific parts of the response header:

headerValue = char(urlConnection.getHeaderField(headerIndex));
headerName  = char(urlConnection.getHeaderFieldKey(headerIndex));

headerIndex starts at 0 and increases by 1 until both headerValue and headerName return empty.

It is important to note that header keys (names) can be repeated for different values. Sometimes this is desired, such as if there are multiple cookies being sent to the user. To generically handle this case, two header structures are returned. In both cases the header names are the field names in the structure, after replacing hyphens with underscores. In one case, allHeaders, the values are cell arrays of strings containing all values presented with the particular key. The other structure, firstHeaders, contains only the first instance of the header as a string to avoid needing to dereference a cell array.

Addressing Problem 4 – Response body

urlread assumes text output. This is fine for most webpages, which use HTML and are therefore text-based. However, urlread fails when trying to download any non-text resource such as an image, a ZIP file, or a PDF document. I have added a flag in urlread2 called CAST_OUTPUT, which defaults to true, i.e. text response, just as urlread assumes. Using varargin, this flag can be set to false ({‘CAST_OUTPUT’,false}) to indicate a binary response.

Summary

urlread2‘s functionality has been expanded to also address other limitations of urlread: It enables binary inputs, better character-set handling of the output, redirection following, and read timeouts.

The modifications described above provide direct access to the key components of the HTTP request and response messages. Its more generic nature lets urlread2 focus on HTTP transmission, and leaves request formation and response interpretation up to the user. I think ultimately this approach is better than providing one-off modifications of the original urlread function to suit a particular need. urlread2 and supporting files can be found on the Matlab File Exchange.

Related posts:

  1. Undocumented XML functionality Matlab's built-in XML-processing functions have several undocumented features that can be used by Java-savvy users...
  2. GUI automation using a Robot This article explains how Java's Robot class can be used to programmatically control mouse and keyboard actions...
  3. GUI automation utilities This article explains a couple of Matlab utilities that use Java's Robot class to programmatically control mouse and keyboard actions...
  4. Matlab-Java interface using a static control The switchyard function design pattern can be very useful when setting Matlab callbacks to Java GUI controls. This article explains why and how....
  5. Matlab installation woes Matlab has some issues when installing a new version. This post discusses some of them and how to overcome them....
  6. Using Groovy in Matlab Groovy code can seamlessly be run from within Matlab. ...

Categories: Guest bloggers, Java, Low risk of breaking in future versions, Stock Matlab function

Tags: ,

Bookmark and SharePrint Print

12 Responses to Expanding urlread capabilities

  1. Andrew Walsh says:

    Thanks for this!
    I am trying to use urlread to scrape a website for stats

    eg.
    http://www.footywire.com/afl/footy/ft_match_statistics?mid=5343

    which works with a browser but not with urlread which returns

    I figure I can use urlread2 to do the refresh for me somehow but as I am fairly inexperienced with scraping I dont know exactly how to do this.
    I imagine it would be fairly easy.
    Can you give me a tip please?

  2. Richard says:

    I found it much easier to create a urlreadbin.m file based on urlread with a single line change.
    Edit urlread.m in the iofun directory using “open urlread.m” and change the following:

    %output = native2unicode(typecast(byteArrayOutputStream.toByteArray','uint8'),'UTF-8');
    output = typecast(byteArrayOutputStream.toByteArray','uint8'); %052012 oglraz

    Save as urlreadbin.m in the iofun directory.

    This eliminates the bothersome unicode conversion.

    This is only for reads but grabs jpg files without an issue.

  3. Alejo says:

    Hi, i’m trying to read an image using urlread2
    Images is on http://192.168.1.20/snapshot.cgi how can i do that? Or how can i do this with imread?
    Thanks!

  4. cglopez says:

    How can I used urlread2 with basic authentication.
    Thanks

  5. Raphael says:

    Hello, is the timeout property working ?

    I have matlab r2011b but cannot make it work.
    my query: urlread2(‘http://www.google.com', ‘GET’, ”, [], ‘READ_TIMEOUT’, 100);

    I tried 1, 100, 1000 as I wasn’t sure of the unit, but all of them time out in ~20seconds.

  6. Raphael says:

    is there a way to make the function work for local files/machines ?

    I am trying to find a way to check the connection to a local server faster than doing exist(path, ‘dir’). this function has a 20s time out

  7. ibrahim says:

    I’m trying to post some data to local web server. I could not get any result. could you help me to identify the problem in my code.

    clear
    clc
    uname='ibrahim';
    email='test@tes.com';
    reemail= email;
    pswd='1234';
    confirm= pswd;
    counter=1;
    submit='submit';
     
    while (counter < 2)
        username=[uname,num2str(counter)];
        password=pswd;
        params = {username,password,email,reemail,confirm,'submit'};
        s=urlread('http://localhost/new_sdas/php_code/register.php','POST',params)
        counter=counter+1;
    end
    • Yair Altman says:

      @ibrahim – this is not a general-purpose blog but one that is devoted to undocumented and advanced aspects of Matlab. With regular Matlab questions you will find more luck at the Matlab answers forum.

      I would try checking if the web-page is accessible (and accepts POST queries) using a regular browser, before trying in Matlab’s urlread. It’s a localhost server so maybe you forgot to turn it on. Also, maybe it works via cookies, which urlread does not support.

  8. J Syn says:

    I was using urlread2 for the first time yesterday and encountered a very strange error. The script was very simple – containing just one line calling the urlread2 function. The script froze during execution, and Matlab produced errors relating to the “rmdir” command. When I checked the folder I was working in, all files and folders except the currently running script file had been deleted. Can you offer any explanation as to why this might have happened? Please see the Matlab output below:

    “Operation terminated by user during urlread2 (line 199)
    In TestDatabaseRead (line 3)
    str =
    urlread2(‘http://localhost:13595/I/GetLocalData‘);
    Error using rmdir”

    • J Syn says:

      Just an update to my previous post – I have tried running the same code on another machine and had no issues at all. I have no idea what happened, but it must have been my own fault somehow. The GET functionality is so much faster using this than the urlread method supplied with Matlab. Jim was also fantastic, replying to my e-mail query right away.

  9. Tjarko says:

    Hi, I’m trying to use the urlread2 function for sending PUT requests. I’m new to webservices for that I need a little help.

    I want to do this

    PUT http://[DEVICE_IP]:7979/rest/devices/battery/C23 HTTP/1.1
    Host: [IP]
    Content-type: text/html
    Content-length: 4
    1000

    What do I have to do in Matlab to perform this action?

Leave a Reply

Your email address will not be published. Required fields are marked *

*

<pre lang="matlab">
a = magic(3);
sum(a)
</pre>