Expanding urlread capabilities

I would like to welcome guest blogger Jim Hokanson. Today, Jim will explain some of the limitations that at one time or another many of us have encountered with Matlab’s built-in urlread function. More importantly, Jim has spent a lot of time in creating an expanded-capabilities Matlab function, which he explains below. Note that while urlread‘s internals are undocumented, the changes outlined below rely on pretty standard Java and HTTP, and should therefore be quite safe to use on multiple Matlab versions.

Abstract

I recently tried to implement the Mendeley API but quickly found that urlread was not going to be sufficient for my needs. The first indication of this was my inability to send the proper authorization information for POST requests. It became even more obvious with the need to perform DELETE and PUT methods, since urlread only supports GET and POST. My implementation of urlread, which I refer to as urlread2, addresses these and a couple of other issues and can be found on the Matlab File Exchange. Other developers have tackled urlread‘s shortcomings (this example added timeout support, and this example added support for binary file upload) – today’s article will focus on my solution, but others are obviously possible.

Introduction

HTTP is the underlying computer networking protocol that enables us to read webpages on the Internet. It consists of a request made by the user to an Internet server (typically located via URL), and a response from that server. Importantly, the request and response consist of three main parts: a resource line (for requests) or status line (for responses), followed by headers, and optionally a message body.

Matlab’s built-in urlread function enables Matlab users to easily read the server’s response text into a Matlab string:

text = urlread('http://www.google.com');

This is done internally using Java code that connects to the specified URL and reads the information sent by the URL’s server (more on this).

urlread accepts optional additional inputs specifying the request type (‘get’ or ‘post’) and parameter values for the request.

Unfortunately, urlread has the following limitations:

  1. It does not allow specification of request headers
  2. It makes assumptions as to the request headers needed based on the input method
  3. It does not expose the response headers and status line
  4. It assumes the response body contains text, and not a binary payload
  5. It does not enable uploading binary contents to the server
  6. It does not enable specifying a timeout in case the server is not responding

urlread2

The urlread2 function addresses all of these problems. The overall design decision for this function was to make it more general, requiring more work up front to use in some cases, but more flexibility.

For reference, the following is the calling format for urlread2 (which is reminiscent of urlread‘s):

urlread2(url,*method,*body,*headersIn, varargin)

The * indicate optional inputs that must be spatially maintained.

  • url – (string), url to request
  • method – (string, default GET) HTTP request method
  • body – (string, default ”), body of the request
  • headersIn – (structure, default []), see the following section
  • varargin – extra properties that need to be specified via property/pair values

Addressing Problem 1 – Request header

urlread internally uses a Java object called urlConnection that is generally an instance of the class sun.net.www.protocol.http.HttpURLConnection. The method setRequestProperty() can be used to set headers for the request. This method has two inputs, the header name and the value of that header. A simple example of this can be seen below:

urlConnection.setRequestProperty('Content-Type','application/x-www-form-urlencoded');

Here ‘Content-Type’ is the header name and the second input is the value of that property. My function requires passing in nearly all headers as a structure array, with fields for the name and value. The preceding header would be created using a helper function http_createHeader.m:

header = http_createHeader('Content-Type','application/x-www-form-urlencoded');

Multiple headers can be passed in to the function by concatenating header structures into a structure array.

Addressing Problem 2 – Request parameters

When making a POST request, parameters are generally specified in the message body using the following format:

[property]=[value]&[property]=[value]

The properties and values are also encoded in a particular way, generally termed urlencoded (encoding and decoding can be done using Matlab’s built-in urlencode and urldecode functions). For GET requests this string is appended to the url with the “?” symbol. Since urlencoding methods can vary, and in the spirit of reducing assumptions, I use separate functions to generate these strings outside of urlread2, and then pass the result in either as the url (for GET) or as the body input (for POST). As an example, I might search the Mathworks website using the upper right search bar on its site for “undocumented matlab” under file exchange (hmmm… pretty cute stuff there!). Doing this performs a GET request with the following property/value pairs:

params = {'search_submit','fileexchange', 'term','undocumented matlab', 'query','undocumented matlab'};

These property/value pairs are somewhat obvious from looking at the URL, but could also be determined by using programs such as Fiddler, Firebug, or HttpWatch.

After urlencoding and concatenating, we would form the following string:

search_submit=fileexchange&term=undocumented+matlab&query=undocumented+matlab

This functionality is normally accomplished internally in urlread, but I use a function http_paramsToString to produce that result. That function also returns the required header for POST requests. The following is an example of both GET and POST requests:

[queryString,header] = http_paramsToString(params,1);
 
% For GET:
url = [url '?' queryString];
urlread2(url)
 
% For POST:
urlread2(url,'POST',queryString,header)

Addressing Problem 3 – Response header

According to the HTTP protocol, each server response starts with a simple header that indicates a numeric response status. The following Matlab code provides access to the status line using the urlConnection object:

status = struct('value',urlConnection.getResponseCode(), 'msg',char(urlConnection.getResponseMessage))
status = 
    value: 200
      msg: 'OK'

urlConnection‘s getHeaderField() and getHeaderFieldKey() methods enable reading the specific parts of the response header:

headerValue = char(urlConnection.getHeaderField(headerIndex));
headerName  = char(urlConnection.getHeaderFieldKey(headerIndex));

headerIndex starts at 0 and increases by 1 until both headerValue and headerName return empty.

It is important to note that header keys (names) can be repeated for different values. Sometimes this is desired, such as if there are multiple cookies being sent to the user. To generically handle this case, two header structures are returned. In both cases the header names are the field names in the structure, after replacing hyphens with underscores. In one case, allHeaders, the values are cell arrays of strings containing all values presented with the particular key. The other structure, firstHeaders, contains only the first instance of the header as a string to avoid needing to dereference a cell array.

Addressing Problem 4 – Response body

urlread assumes text output. This is fine for most webpages, which use HTML and are therefore text-based. However, urlread fails when trying to download any non-text resource such as an image, a ZIP file, or a PDF document. I have added a flag in urlread2 called CAST_OUTPUT, which defaults to true, i.e. text response, just as urlread assumes. Using varargin, this flag can be set to false ({‘CAST_OUTPUT’,false}) to indicate a binary response.

Summary

urlread2‘s functionality has been expanded to also address other limitations of urlread: It enables binary inputs, better character-set handling of the output, redirection following, and read timeouts.

The modifications described above provide direct access to the key components of the HTTP request and response messages. Its more generic nature lets urlread2 focus on HTTP transmission, and leaves request formation and response interpretation up to the user. I think ultimately this approach is better than providing one-off modifications of the original urlread function to suit a particular need. urlread2 and supporting files can be found on the Matlab File Exchange.

Related posts:

  1. Inactive Control Tooltips & Event Chaining Inactive Matlab uicontrols cannot normally display their tooltips. This article shows how to do this with a combination of undocumented Matlab and Java hacks....
  2. GUI automation using a Robot This article explains how Java's Robot class can be used to programmatically control mouse and keyboard actions...
  3. GUI automation utilities This article explains a couple of Matlab utilities that use Java's Robot class to programmatically control mouse and keyboard actions...
  4. JBoost – Integrating an external Java library in Matlab This article shows how an external Java library can be integrated in Matlab...
  5. Matlab-Java memory leaks, performance Internal fields of Java objects may leak memory - this article explains how to avoid this without sacrificing performance. ...
  6. Pause for the better Java's thread sleep() function is much more accurate than Matlab's pause() function. ...

Categories: Guest bloggers, Java, Low risk of breaking in future versions, Stock Matlab function

Tags: ,

Bookmark and SharePrint Print

9 Responses to Expanding urlread capabilities

  1. Andrew Walsh says:

    Thanks for this!
    I am trying to use urlread to scrape a website for stats

    eg.
    http://www.footywire.com/afl/footy/ft_match_statistics?mid=5343

    which works with a browser but not with urlread which returns

    I figure I can use urlread2 to do the refresh for me somehow but as I am fairly inexperienced with scraping I dont know exactly how to do this.
    I imagine it would be fairly easy.
    Can you give me a tip please?

  2. Richard says:

    I found it much easier to create a urlreadbin.m file based on urlread with a single line change.
    Edit urlread.m in the iofun directory using “open urlread.m” and change the following:

    %output = native2unicode(typecast(byteArrayOutputStream.toByteArray','uint8'),'UTF-8');
    output = typecast(byteArrayOutputStream.toByteArray','uint8'); %052012 oglraz

    Save as urlreadbin.m in the iofun directory.

    This eliminates the bothersome unicode conversion.

    This is only for reads but grabs jpg files without an issue.

  3. Alejo says:

    Hi, i’m trying to read an image using urlread2
    Images is on http://192.168.1.20/snapshot.cgi how can i do that? Or how can i do this with imread?
    Thanks!

  4. cglopez says:

    How can I used urlread2 with basic authentication.
    Thanks

  5. Raphael says:

    Hello, is the timeout property working ?

    I have matlab r2011b but cannot make it work.
    my query: urlread2(‘http://www.google.com', ‘GET’, ”, [], ‘READ_TIMEOUT’, 100);

    I tried 1, 100, 1000 as I wasn’t sure of the unit, but all of them time out in ~20seconds.

  6. Raphael says:

    is there a way to make the function work for local files/machines ?

    I am trying to find a way to check the connection to a local server faster than doing exist(path, ‘dir’). this function has a 20s time out

  7. ibrahim says:

    I’m trying to post some data to local web server. I could not get any result. could you help me to identify the problem in my code.

    clear
    clc
    uname='ibrahim';
    email='test@tes.com';
    reemail= email;
    pswd='1234';
    confirm= pswd;
    counter=1;
    submit='submit';
     
    while (counter < 2)
        username=[uname,num2str(counter)];
        password=pswd;
        params = {username,password,email,reemail,confirm,'submit'};
        s=urlread('http://localhost/new_sdas/php_code/register.php','POST',params)
        counter=counter+1;
    end
    • Yair Altman says:

      @ibrahim – this is not a general-purpose blog but one that is devoted to undocumented and advanced aspects of Matlab. With regular Matlab questions you will find more luck at the Matlab answers forum.

      I would try checking if the web-page is accessible (and accepts POST queries) using a regular browser, before trying in Matlab’s urlread. It’s a localhost server so maybe you forgot to turn it on. Also, maybe it works via cookies, which urlread does not support.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

<pre lang="matlab">
a = magic(3);
sum(a)
</pre>