Speeding up Matlab-JDBC SQL queries

Many of my consulting projects involve interfacing a Matlab program to an SQL database. In such cases, using MathWorks’ Database Toolbox is a viable solution. Users who don’t have the toolbox can also easily connect directly to the database using either the standard ODBC bridge (which is horrible for performance and stability), or a direct JDBC connection (which is also what the Database Toolbox uses under the hood). I explained this Matlab-JDBC interface in detail in chapter 2 of my Matlab-Java programming book. A bare-bones implementation of an SQL SELECT query follows (data update queries are a bit different and will not be discussed here):

% Load the appropriate JDBC driver class into Matlab's memory
% (but not directly, to bypass JIT pre-processing - we must do it in run-time!)
driver = eval('com.mysql.jdbc.Driver');  % or com.microsoft.sqlserver.jdbc.SQLServerDriver or whatever
 
% Connect to DB
dbPort = '3306'; % mySQL=3306; SQLServer=1433; Oracle=...
connectionStr = ['jdbc:mysql://' dbURL ':' dbPort '/' schemaName];  % or ['jdbc:sqlserver://' dbURL ':' dbPort ';database=' schemaName ';'] or whatever
dbConnObj = java.sql.DriverManager.getConnection(connectionStr, username, password);
 
% Send an SQL query statement to the DB and get the ResultSet
stmt = dbConnObj.createStatement(java.sql.ResultSet.TYPE_SCROLL_INSENSITIVE, java.sql.ResultSet.CONCUR_READ_ONLY);
try stmt.setFetchSize(1000); catch, end  % the default fetch size is ridiculously small in many DBs
rs = stmt.executeQuery(sqlQueryStr);
 
% Get the column names and data-types from the ResultSet's meta-data
MetaData = rs.getMetaData;
numCols = MetaData.getColumnCount;
data = cell(0,numCols);  % initialize
for colIdx = numCols : -1 : 1
    ColumnNames{colIdx} = char(MetaData.getColumnLabel(colIdx));
    ColumnType{colIdx}  = char(MetaData.getColumnClassName(colIdx));  % http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html
end
ColumnType = regexprep(ColumnType,'.*\.','');
 
% Get the data from the ResultSet into a Matlab cell array
rowIdx = 1;
while rs.next  % loop over all ResultSet rows (records)
    for colIdx = 1 : numCols  % loop over all columns in the row
        switch ColumnType{colIdx}
            case {'Float','Double'}
                data{rowIdx,colIdx} = rs.getDouble(colIdx);
            case {'Long','Integer','Short','BigDecimal'}
                data{rowIdx,colIdx} = double(rs.getDouble(colIdx));
            case 'Boolean'
                data{rowIdx,colIdx} = logical(rs.getBoolean(colIdx));
            otherwise %case {'String','Date','Time','Timestamp'}
                data{rowIdx,colIdx} = char(rs.getString(colIdx));
        end
    end
    rowIdx = rowIdx + 1;
end
 
% Close the connection and clear resources
try rs.close();   catch, end
try stmt.close(); catch, end
try dbConnObj.closeAllStatements(); catch, end
try dbConnObj.close(); catch, end  % comment this to keep the dbConnObj open and reuse it for subsequent queries

Naturally, in a real-world implementation you also need to handle database timeouts and various other errors, handle data-manipulation queries (not just SELECTs), etc.

Anyway, this works well in general, but when you try to fetch a ResultSet that has many thousands of records you start to feel the pain – The SQL statement may execute much faster on the DB server (the time it takes for the stmt.executeQuery call), yet the subsequent double-loop processing to fetch the data from the Java ResultSet object into a Matlab cell array takes much longer.

In one of my recent projects, performance was of paramount importance, and the DB query speed from the code above was simply not good enough. You might think that this was due to the fact that the data cell array is not pre-allocated, but this turns out to be incorrect: the speed remains nearly unaffected when you pre-allocate data properly. It turns out that the main problem is due to Matlab’s non-negligible overhead in calling methods of Java objects. Since the JDBC interface only enables retrieving a single data item at a time (in other words, bulk retrieval is not possible), we have a double loop over all the data’s rows and columns, in each case calling the appropriate Java method to retrieve the data based on the column’s type. The Java methods themselves are extremely efficient, but when you add Matlab’s invocation overheads the total processing time is much much slower.

So what can be done? As Andrew Janke explained in much detail, we basically need to push our double loop down into the Java level, so that Matlab receives arrays of primitive values, which can then be processed in a vectorized manner in Matlab.

So let’s create a simple Java class to do this:

// Copyright (c) Yair Altman UndocumentedMatlab.com
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.SQLException;
import java.sql.Types;
 
public class JDBC_Fetch {
 
	public static int DEFAULT_MAX_ROWS = 100000;   // default cache size = 100K rows (if DB does not support non-forward-only ResultSets)
 
	public static Object[] getData(ResultSet rs) throws SQLException {
		try {
			if (rs.last()) {  // data is available
				int numRows = rs.getRow();    // row # of the last row
				rs.beforeFirst();             // get back to the top of the ResultSet
				return getData(rs, numRows);  // fetch the data
			} else {  // no data in the ResultSet
				return null;
			}
		} catch (Exception e) {
			return getData(rs, DEFAULT_MAX_ROWS);
		}
	}
 
	public static Object[] getData(ResultSet rs, int maxRows) throws SQLException {
		// Read column number and types from the ResultSet's meta-data
		ResultSetMetaData metaData = rs.getMetaData();
		int numCols = metaData.getColumnCount();
		int[] colTypes = new int[numCols+1];
		int numDoubleCols = 0;
		int numBooleanCols = 0;
		int numStringCols = 0;
		for (int colIdx = 1; colIdx <= numCols; colIdx++) {
			int colType = metaData.getColumnType(colIdx);
			switch (colType) {
				case Types.FLOAT:
				case Types.DOUBLE:
				case Types.REAL:
					colTypes[colIdx] = 1;  // double
					numDoubleCols++;
					break;
				case Types.DECIMAL:
				case Types.INTEGER:
				case Types.TINYINT:
				case Types.SMALLINT:
				case Types.BIGINT:
					colTypes[colIdx] = 1;  // double
					numDoubleCols++;
					break;
				case Types.BIT:
				case Types.BOOLEAN:
					colTypes[colIdx] = 2;  // boolean
					numBooleanCols++;
					break;
				default: // 'String','Date','Time','Timestamp',...
					colTypes[colIdx] = 3;  // string
					numStringCols++;
			}
		}
 
		// Loop over all ResultSet rows, reading the data into the 2D matrix caches
		int rowIdx = 0;
		double [][] dataCacheDouble  = new double [numDoubleCols] [maxRows];
		boolean[][] dataCacheBoolean = new boolean[numBooleanCols][maxRows];
		String [][] dataCacheString  = new String [numStringCols] [maxRows];
		while (rs.next() && rowIdx < maxRows) {
			int doubleColIdx = 0;
			int booleanColIdx = 0;
			int stringColIdx = 0;
			for (int colIdx = 1; colIdx <= numCols; colIdx++) {
				try {
					switch (colTypes[colIdx]) {
						case 1:  dataCacheDouble[doubleColIdx++][rowIdx]   = rs.getDouble(colIdx);   break;  // numeric
						case 2:  dataCacheBoolean[booleanColIdx++][rowIdx] = rs.getBoolean(colIdx);  break;  // boolean
						default: dataCacheString[stringColIdx++][rowIdx]   = rs.getString(colIdx);   break;  // string
					}
				} catch (Exception e) {
					System.out.println(e);
					System.out.println(" in row #" + rowIdx + ", col #" + colIdx);
				}
			}
			rowIdx++;
		}
 
		// Return only the actual data in the ResultSet
		int doubleColIdx = 0;
		int booleanColIdx = 0;
		int stringColIdx = 0;
		Object[] data = new Object[numCols];
		for (int colIdx = 1; colIdx <= numCols; colIdx++) {
			switch (colTypes[colIdx]) {
				case 1:   data[colIdx-1] = dataCacheDouble[doubleColIdx++];    break;  // numeric
				case 2:   data[colIdx-1] = dataCacheBoolean[booleanColIdx++];  break;  // boolean
				default:  data[colIdx-1] = dataCacheString[stringColIdx++];            // string
			}
		}
		return data;
	}
}

So now we have a JDBC_Fetch class that we can use in our Matlab code, replacing the slow double loop with a single call to JDBC_Fetch.getData(), followed by vectorized conversion into a Matlab cell array (matrix):

% Get the data from the ResultSet using the JDBC_Fetch wrapper
data = cell(JDBC_Fetch.getData(rs));
for colIdx = 1 : numCols
   switch ColumnType{colIdx}
      case {'Float','Double'}
          data{colIdx} = num2cell(data{colIdx});
      case {'Long','Integer','Short','BigDecimal'}
          data{colIdx} = num2cell(data{colIdx});
      case 'Boolean'
          data{colIdx} = num2cell(data{colIdx});
      otherwise %case {'String','Date','Time','Timestamp'}
          %data{colIdx} = cell(data{colIdx});  % no need to do anything here!
   end
end
data = [data{:}];

On my specific program the resulting speedup was 15x (this is not a typo: 15 times faster). My fetches are no longer limited by the Matlab post-processing, but rather by the DB’s processing of the SQL statement (where DB indexes, clustering, SQL tuning etc. come into play).

Additional speedups can be achieved by parsing dates at the Java level (rather than returning strings), as well as several other tweaks in the Java and Matlab code (refer to Andrew Janke’s post for some ideas). But certainly the main benefit (the 80% of the gain that was achieved in 20% of the worktime) is due to the above push of the main double processing loop down into the Java level, leaving Matlab with just a single Java call to JDBC_Fetch.

Many additional ideas of speeding up database queries and Matlab programs in general can be found in my second book, Accelerating Matlab Performance.

If you’d like me to help you speed up your Matlab program, please email me (altmany at gmail), or fill out the query form on my consulting page.

Categories: Java, Low risk of breaking in future versions, Toolbox, Undocumented feature

Tags: ,

Bookmark and SharePrint Print

5 Responses to Speeding up Matlab-JDBC SQL queries

  1. aviolov says:

    Cool,

    We do a similar thing using C# and DataTable

    import System.Data.DataTable
    import System.Data.SqlClient.*
    sqlConn = SqlConnection(...);
    sqlCmd = SqlCommand(sql_query_string, sqlConn);
    sqlAdaptor = SqlDataAdapter(sqlCmd);
    dataTable = DataTable();
    sqlConn.Open();
    sqlAdaptor.Fill(dataTable);
    T = dataTable2matlabTable(dataTable); % This does the same column based-fill as in your code.
    ...

    So I’m wondering is it really necessary to make the specific class (JDBC_Fetch.java) to iterate the rows of a ResultSet or does Java not have an equivalent to C#’s DataTable that kind of does that for you, i.e. a data-struct that returns columns instead of rows, roughly speaking, and you can already work on its columns instead of calling rs.next() in an extra for loop? Or maybe the equivalent does not give the necessary speed-up?

    best,

  2. virns says:

    Hi
    Please help me get rid of “jave heap memory error” in standalone executable (tried both windows and linux versions). I am creating a SQL connection (using JDBC) outside the infinity while loop and then checking with “isopen or isconnection” inside the loop. I read in some forum that the SQL JDBC driver caches the queries which could be the cause. I’ve also tried deleting unused variables in the loop but it didn’t help.

    Thanks

    • Try to close the ResultSet and stmt (as I’ve shown in the post) after you’ve used them, in order to clear the cache between subsequent JDBC calls, otherwise they might remain in memory and cause an eventual heap-memory error.

  3. Oleg says:

    I faced this issue with the Matlab Database Toolbox and tackled it by:
    * setting the ‘DataReturnFormat’ to ‘numeric’;
    * querying results by column and converting the numeric stream into its column format

Leave a Reply


Your email address will not be published. Required fields are marked *