- Undocumented Matlab - https://undocumentedmatlab.com -

Speeding up Matlab-JDBC SQL queries

Posted By Yair Altman On November 16, 2016 @ 13:43 In Java,Low risk of breaking in future versions,Toolbox,Undocumented feature | 9 Comments

Many of my consulting projects involve interfacing a Matlab program to an SQL database. In such cases, using MathWorks’ Database Toolbox is a viable solution. Users who don’t have the toolbox can also easily connect directly to the database using either the standard ODBC bridge (which is horrible for performance and stability), or a direct JDBC connection ^[1] (which is also what the Database Toolbox uses under the hood). I explained this Matlab-JDBC interface in detail in chapter 2 of my Matlab-Java programming book ^[2]. A bare-bones implementation of an SQL SELECT query follows (data update queries are a bit different and will not be discussed here):

% Load the appropriate JDBC driver class into Matlab's memory
% (but not directly, to bypass JIT pre-processing - we must do it in run-time!)
driver = eval('com.mysql.jdbc.Driver');  % or com.microsoft.sqlserver.jdbc.SQLServerDriver or whatever
% Connect to DB
dbPort = '3306'; % mySQL=3306; SQLServer=1433; Oracle=...
connectionStr = ['jdbc:mysql://' dbURL ':' dbPort '/' schemaName];  % or ['jdbc:sqlserver://' dbURL ':' dbPort ';database=' schemaName ';'] or whatever
dbConnObj = java.sql.DriverManager.getConnection(connectionStr, username, password);
% Send an SQL query statement to the DB and get the ResultSet
stmt = dbConnObj.createStatement(java.sql.ResultSet.TYPE_SCROLL_INSENSITIVE, java.sql.ResultSet.CONCUR_READ_ONLY);
try stmt.setFetchSize(1000); catch, end  % the default fetch size is ridiculously small in many DBs
rs = stmt.executeQuery(sqlQueryStr);
% Get the column names and data-types from the ResultSet's meta-data
MetaData = rs.getMetaData;
numCols = MetaData.getColumnCount;
data = cell(0,numCols);  % initialize
for colIdx = numCols : -1 : 1
    ColumnNames{colIdx} = char(MetaData.getColumnLabel(colIdx));
    ColumnType{colIdx}  = char(MetaData.getColumnClassName(colIdx));  % http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html
end
ColumnType = regexprep(ColumnType,'.*\.','');
% Get the data from the ResultSet into a Matlab cell array
rowIdx = 1;
while rs.next  % loop over all ResultSet rows (records)
    for colIdx = 1 : numCols  % loop over all columns in the row
        switch ColumnType{colIdx}
            case {'Float','Double'}
                data{rowIdx,colIdx} = rs.getDouble(colIdx);
            case {'Long','Integer','Short','BigDecimal'}
                data{rowIdx,colIdx} = double(rs.getDouble(colIdx));
            case 'Boolean'
                data{rowIdx,colIdx} = logical(rs.getBoolean(colIdx));
            otherwise %case {'String','Date','Time','Timestamp'}
                data{rowIdx,colIdx} = char(rs.getString(colIdx));
        end
    end
    rowIdx = rowIdx + 1;
end
% Close the connection and clear resources
try rs.close();   catch, end
try stmt.close(); catch, end
try dbConnObj.closeAllStatements(); catch, end
try dbConnObj.close(); catch, end  % comment this to keep the dbConnObj open and reuse it for subsequent queries

Naturally, in a real-world implementation you also need to handle database timeouts and various other errors, handle data-manipulation queries (not just SELECTs), etc.
Anyway, this works well in general, but when you try to fetch a ResultSet that has many thousands of records you start to feel the pain – The SQL statement may execute much faster on the DB server (the time it takes for the stmt.executeQuery call), yet the subsequent double-loop processing to fetch the data from the Java ResultSet object into a Matlab cell array takes much longer.
In one of my recent projects, performance was of paramount importance, and the DB query speed from the code above was simply not good enough. You might think that this was due to the fact that the data cell array is not pre-allocated, but this turns out to be incorrect: the speed remains nearly unaffected when you pre-allocate data properly. It turns out that the main problem is due to Matlab’s non-negligible overhead in calling methods of Java objects. Since the JDBC interface only enables retrieving a single data item at a time (in other words, bulk retrieval is not possible), we have a double loop over all the data’s rows and columns, in each case calling the appropriate Java method to retrieve the data based on the column’s type. The Java methods themselves are extremely efficient, but when you add Matlab’s invocation overheads the total processing time is much much slower.
So what can be done? As Andrew Janke explained ^[3] in much detail, we basically need to push our double loop down into the Java level, so that Matlab receives arrays of primitive values, which can then be processed in a vectorized manner in Matlab.
So let’s create a simple Java class to do this:

// Copyright (c) Yair Altman UndocumentedMatlab.com
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.SQLException;
import java.sql.Types;
public class JDBC_Fetch {
	public static int DEFAULT_MAX_ROWS = 100000;   // default cache size = 100K rows (if DB does not support non-forward-only ResultSets)
	public static Object[] getData(ResultSet rs) throws SQLException {
		try {
			if (rs.last()) {  // data is available
				int numRows = rs.getRow();    // row # of the last row
				rs.beforeFirst();             // get back to the top of the ResultSet
				return getData(rs, numRows);  // fetch the data
			} else {  // no data in the ResultSet
				return null;
			}
		} catch (Exception e) {
			return getData(rs, DEFAULT_MAX_ROWS);
		}
	}
	public static Object[] getData(ResultSet rs, int maxRows) throws SQLException {
		// Read column number and types from the ResultSet's meta-data
		ResultSetMetaData metaData = rs.getMetaData();
		int numCols = metaData.getColumnCount();
		int[] colTypes = new int[numCols+1];
		int numDoubleCols = 0;
		int numBooleanCols = 0;
		int numStringCols = 0;
		for (int colIdx = 1; colIdx <= numCols; colIdx++) {
			int colType = metaData.getColumnType(colIdx);
			switch (colType) {
				case Types.FLOAT:
				case Types.DOUBLE:
				case Types.REAL:
					colTypes[colIdx] = 1;  // double
					numDoubleCols++;
					break;
				case Types.DECIMAL:
				case Types.INTEGER:
				case Types.TINYINT:
				case Types.SMALLINT:
				case Types.BIGINT:
					colTypes[colIdx] = 1;  // double
					numDoubleCols++;
					break;
				case Types.BIT:
				case Types.BOOLEAN:
					colTypes[colIdx] = 2;  // boolean
					numBooleanCols++;
					break;
				default: // 'String','Date','Time','Timestamp',...
					colTypes[colIdx] = 3;  // string
					numStringCols++;
			}
		}
		// Loop over all ResultSet rows, reading the data into the 2D matrix caches
		int rowIdx = 0;
		double [][] dataCacheDouble  = new double [numDoubleCols] [maxRows];
		boolean[][] dataCacheBoolean = new boolean[numBooleanCols][maxRows];
		String [][] dataCacheString  = new String [numStringCols] [maxRows];
		while (rs.next() && rowIdx < maxRows) {
			int doubleColIdx = 0;
			int booleanColIdx = 0;
			int stringColIdx = 0;
			for (int colIdx = 1; colIdx <= numCols; colIdx++) {
				try {
					switch (colTypes[colIdx]) {
						case 1:  dataCacheDouble[doubleColIdx++][rowIdx]   = rs.getDouble(colIdx);   break;  // numeric
						case 2:  dataCacheBoolean[booleanColIdx++][rowIdx] = rs.getBoolean(colIdx);  break;  // boolean
						default: dataCacheString[stringColIdx++][rowIdx]   = rs.getString(colIdx);   break;  // string
					}
				} catch (Exception e) {
					System.out.println(e);
					System.out.println(" in row #" + rowIdx + ", col #" + colIdx);
				}
			}
			rowIdx++;
		}
		// Return only the actual data in the ResultSet
		int doubleColIdx = 0;
		int booleanColIdx = 0;
		int stringColIdx = 0;
		Object[] data = new Object[numCols];
		for (int colIdx = 1; colIdx <= numCols; colIdx++) {
			switch (colTypes[colIdx]) {
				case 1:   data[colIdx-1] = dataCacheDouble[doubleColIdx++];    break;  // numeric
				case 2:   data[colIdx-1] = dataCacheBoolean[booleanColIdx++];  break;  // boolean
				default:  data[colIdx-1] = dataCacheString[stringColIdx++];            // string
			}
		}
		return data;
	}
}

So now we have a JDBC_Fetch class that we can use in our Matlab code, replacing the slow double loop with a single call to JDBC_Fetch.getData(), followed by vectorized conversion into a Matlab cell array (matrix):

% Get the data from the ResultSet using the JDBC_Fetch wrapper
data = cell(JDBC_Fetch.getData(rs));
for colIdx = 1 : numCols
   switch ColumnType{colIdx}
      case {'Float','Double'}
          data{colIdx} = num2cell(data{colIdx});
      case {'Long','Integer','Short','BigDecimal'}
          data{colIdx} = num2cell(data{colIdx});
      case 'Boolean'
          data{colIdx} = num2cell(data{colIdx});
      otherwise %case {'String','Date','Time','Timestamp'}
          %data{colIdx} = cell(data{colIdx});  % no need to do anything here!
   end
end
data = [data{:}];

On my specific program the resulting speedup was 15x (this is not a typo: 15 times faster). My fetches are no longer limited by the Matlab post-processing, but rather by the DB's processing of the SQL statement (where DB indexes, clustering, SQL tuning etc. come into play).
Additional speedups can be achieved by parsing dates at the Java level (rather than returning strings), as well as several other tweaks in the Java and Matlab code (refer to Andrew Janke's post for some ideas). But certainly the main benefit (the 80% of the gain that was achieved in 20% of the worktime) is due to the above push of the main double processing loop down into the Java level, leaving Matlab with just a single Java call to JDBC_Fetch.
Many additional ideas of speeding up database queries and Matlab programs in general can be found in my second book, Accelerating Matlab Performance ^[4].
If you'd like me to help you speed up your Matlab program, please email me (altmany at gmail), or fill out the query form on my consulting page ^[5].

9 Comments (Open | Close)

9 Comments To "Speeding up Matlab-JDBC SQL queries"

#1 Comment By aviolov On November 21, 2016 @ 10:24

Cool,

We do a similar thing using C# and DataTable

import System.Data.DataTable
import System.Data.SqlClient.*
sqlConn = SqlConnection(...);
sqlCmd = SqlCommand(sql_query_string, sqlConn);
sqlAdaptor = SqlDataAdapter(sqlCmd);
dataTable = DataTable();
sqlConn.Open();
sqlAdaptor.Fill(dataTable);
T = dataTable2matlabTable(dataTable); % This does the same column based-fill as in your code.
...

So I’m wondering is it really necessary to make the specific class (JDBC_Fetch.java) to iterate the rows of a ResultSet or does Java not have an equivalent to C#’s DataTable that kind of does that for you, i.e. a data-struct that returns columns instead of rows, roughly speaking, and you can already work on its columns instead of calling rs.next() in an extra for loop? Or maybe the equivalent does not give the necessary speed-up?

best,

#2 Comment By Yair Altman On November 21, 2016 @ 18:57

I do not believe that JDBC has such an easy-to-use wrapper, unfortunately.

#3 Comment By virns On November 21, 2016 @ 12:51

Hi
Please help me get rid of “jave heap memory error” in standalone executable (tried both windows and linux versions). I am creating a SQL connection (using JDBC) outside the infinity while loop and then checking with “isopen or isconnection” inside the loop. I read in some forum that the SQL JDBC driver caches the queries which could be the cause. I’ve also tried deleting unused variables in the loop but it didn’t help.

Thanks

#4 Comment By Yair Altman On November 22, 2016 @ 08:43

Try to close the ResultSet and stmt (as I’ve shown in the post) after you’ve used them, in order to clear the cache between subsequent JDBC calls, otherwise they might remain in memory and cause an eventual heap-memory error.

#5 Comment By Oleg On November 21, 2016 @ 18:46

I faced this issue with the Matlab Database Toolbox and tackled it by:
* setting the ‘DataReturnFormat’ to ‘numeric’;
* querying results by column and converting the numeric stream into its column format

#6 Comment By Paul On May 9, 2017 @ 02:19

Hey there,

just wanted to say thank you for this page. I found a lot of valuable information here how to handle result sets from java in matlab.

Best Regards
Paul

#7 Comment By Jeremy On September 23, 2019 @ 19:57

Yair,

Thank you for your work, most likely buying the book soon. But in the meantime I’m hitting a snag on setting up the JDBC connection on the fly. Can you provide a bit more explanation and or hints on the steps required? I can use database toolbox to do the setup and it works just fine, but I would rather be able to hardcode it so that my application/code is portable.

#8 Comment By Peter Raahauge On February 13, 2024 @ 13:01

Hi Yair,
I have a copyright question, and a comment to your statement above: “the speed remains nearly unaffected when you pre-allocate data properly.”
One of my colleges are downloading data from Wharton Research Data Service (WRDS) for his research in Matlab.
WRDS are providing a guide for Matlab-users, and it basically includes an algorithm like the first one presented above for getting the data into Matlab’s memory.
The SQL-tables he’s downloading typically have 3 million rows but very few variables.
Initially, he was complaining about the WRDS-solution, as a query like the one above took 2-3 hours. However, once he got the data variable pre-allocated appropriately, the query was done in 5 minutes. That’s actually a factor 30(!)
With your Java-code, we can get the time down even more with something similar to the factor 15 you mention above.
This leads me to the copyright question: How should I deal with the copyright? We are only using your algorithm for research purposes and nothing commercial.

#9 Comment By Yair Altman On February 23, 2024 @ 16:02

@Peter – I’m happy to hear that you found my work useful. Feel free to use my algorithm and code for any non-commercial research usage. I just ask that you add references to undocumentedmatlab.com and my book ( ^[12], CRC Press, ISBN 9781482211290) in your published work.

Related posts: