Many of my consulting projects involve interfacing a Matlab program to an SQL database. In such cases, using MathWorks’ Database Toolbox is a viable solution. Users who don’t have the toolbox can also easily connect directly to the database using either the standard ODBC bridge (which is horrible for performance and stability), or a direct JDBC connection (which is also what the Database Toolbox uses under the hood). I explained this Matlab-JDBC interface in detail in chapter 2 of my Matlab-Java programming book. A bare-bones implementation of an SQL SELECT query follows (data update queries are a bit different and will not be discussed here):
% Load the appropriate JDBC driver class into Matlab's memory % (but not directly, to bypass JIT pre-processing - we must do it in run-time!) driver = eval('com.mysql.jdbc.Driver'); % or com.microsoft.sqlserver.jdbc.SQLServerDriver or whatever % Connect to DB dbPort = '3306'; % mySQL=3306; SQLServer=1433; Oracle=... connectionStr = ['jdbc:mysql://' dbURL ':' dbPort '/' schemaName]; % or ['jdbc:sqlserver://' dbURL ':' dbPort ';database=' schemaName ';'] or whatever dbConnObj = java.sql.DriverManager.getConnection(connectionStr, username, password); % Send an SQL query statement to the DB and get the ResultSet stmt = dbConnObj.createStatement(java.sql.ResultSet.TYPE_SCROLL_INSENSITIVE, java.sql.ResultSet.CONCUR_READ_ONLY); try stmt.setFetchSize(1000); catch, end % the default fetch size is ridiculously small in many DBs rs = stmt.executeQuery(sqlQueryStr); % Get the column names and data-types from the ResultSet's meta-data MetaData = rs.getMetaData; numCols = MetaData.getColumnCount; data = cell(0,numCols); % initialize for colIdx = numCols : -1 : 1 ColumnNames{colIdx} = char(MetaData.getColumnLabel(colIdx)); ColumnType{colIdx} = char(MetaData.getColumnClassName(colIdx)); % http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html end ColumnType = regexprep(ColumnType,'.*\.',''); % Get the data from the ResultSet into a Matlab cell array rowIdx = 1; while rs.next % loop over all ResultSet rows (records) for colIdx = 1 : numCols % loop over all columns in the row switch ColumnType{colIdx} case {'Float','Double'} data{rowIdx,colIdx} = rs.getDouble(colIdx); case {'Long','Integer','Short','BigDecimal'} data{rowIdx,colIdx} = double(rs.getDouble(colIdx)); case 'Boolean' data{rowIdx,colIdx} = logical(rs.getBoolean(colIdx)); otherwise %case {'String','Date','Time','Timestamp'} data{rowIdx,colIdx} = char(rs.getString(colIdx)); end end rowIdx = rowIdx + 1; end % Close the connection and clear resources try rs.close(); catch, end try stmt.close(); catch, end try dbConnObj.closeAllStatements(); catch, end try dbConnObj.close(); catch, end % comment this to keep the dbConnObj open and reuse it for subsequent queries |
Naturally, in a real-world implementation you also need to handle database timeouts and various other errors, handle data-manipulation queries (not just SELECTs), etc.
Anyway, this works well in general, but when you try to fetch a ResultSet that has many thousands of records you start to feel the pain – The SQL statement may execute much faster on the DB server (the time it takes for the stmt.executeQuery
call), yet the subsequent double-loop processing to fetch the data from the Java ResultSet
object into a Matlab cell array takes much longer.
In one of my recent projects, performance was of paramount importance, and the DB query speed from the code above was simply not good enough. You might think that this was due to the fact that the data
cell array is not pre-allocated, but this turns out to be incorrect: the speed remains nearly unaffected when you pre-allocate data
properly. It turns out that the main problem is due to Matlab’s non-negligible overhead in calling methods of Java objects. Since the JDBC interface only enables retrieving a single data item at a time (in other words, bulk retrieval is not possible), we have a double loop over all the data’s rows and columns, in each case calling the appropriate Java method to retrieve the data based on the column’s type. The Java methods themselves are extremely efficient, but when you add Matlab’s invocation overheads the total processing time is much much slower.
So what can be done? As Andrew Janke explained in much detail, we basically need to push our double loop down into the Java level, so that Matlab receives arrays of primitive values, which can then be processed in a vectorized manner in Matlab.
So let’s create a simple Java class to do this:
// Copyright (c) Yair Altman UndocumentedMatlab.com import java.sql.ResultSet; import java.sql.ResultSetMetaData; import java.sql.SQLException; import java.sql.Types; public class JDBC_Fetch { public static int DEFAULT_MAX_ROWS = 100000; // default cache size = 100K rows (if DB does not support non-forward-only ResultSets) public static Object[] getData(ResultSet rs) throws SQLException { try { if (rs.last()) { // data is available int numRows = rs.getRow(); // row # of the last row rs.beforeFirst(); // get back to the top of the ResultSet return getData(rs, numRows); // fetch the data } else { // no data in the ResultSet return null; } } catch (Exception e) { return getData(rs, DEFAULT_MAX_ROWS); } } public static Object[] getData(ResultSet rs, int maxRows) throws SQLException { // Read column number and types from the ResultSet's meta-data ResultSetMetaData metaData = rs.getMetaData(); int numCols = metaData.getColumnCount(); int[] colTypes = new int[numCols+1]; int numDoubleCols = 0; int numBooleanCols = 0; int numStringCols = 0; for (int colIdx = 1; colIdx <= numCols; colIdx++) { int colType = metaData.getColumnType(colIdx); switch (colType) { case Types.FLOAT: case Types.DOUBLE: case Types.REAL: colTypes[colIdx] = 1; // double numDoubleCols++; break; case Types.DECIMAL: case Types.INTEGER: case Types.TINYINT: case Types.SMALLINT: case Types.BIGINT: colTypes[colIdx] = 1; // double numDoubleCols++; break; case Types.BIT: case Types.BOOLEAN: colTypes[colIdx] = 2; // boolean numBooleanCols++; break; default: // 'String','Date','Time','Timestamp',... colTypes[colIdx] = 3; // string numStringCols++; } } // Loop over all ResultSet rows, reading the data into the 2D matrix caches int rowIdx = 0; double [][] dataCacheDouble = new double [numDoubleCols] [maxRows]; boolean[][] dataCacheBoolean = new boolean[numBooleanCols][maxRows]; String [][] dataCacheString = new String [numStringCols] [maxRows]; while (rs.next() && rowIdx < maxRows) { int doubleColIdx = 0; int booleanColIdx = 0; int stringColIdx = 0; for (int colIdx = 1; colIdx <= numCols; colIdx++) { try { switch (colTypes[colIdx]) { case 1: dataCacheDouble[doubleColIdx++][rowIdx] = rs.getDouble(colIdx); break; // numeric case 2: dataCacheBoolean[booleanColIdx++][rowIdx] = rs.getBoolean(colIdx); break; // boolean default: dataCacheString[stringColIdx++][rowIdx] = rs.getString(colIdx); break; // string } } catch (Exception e) { System.out.println(e); System.out.println(" in row #" + rowIdx + ", col #" + colIdx); } } rowIdx++; } // Return only the actual data in the ResultSet int doubleColIdx = 0; int booleanColIdx = 0; int stringColIdx = 0; Object[] data = new Object[numCols]; for (int colIdx = 1; colIdx <= numCols; colIdx++) { switch (colTypes[colIdx]) { case 1: data[colIdx-1] = dataCacheDouble[doubleColIdx++]; break; // numeric case 2: data[colIdx-1] = dataCacheBoolean[booleanColIdx++]; break; // boolean default: data[colIdx-1] = dataCacheString[stringColIdx++]; // string } } return data; } } |
So now we have a JDBC_Fetch
class that we can use in our Matlab code, replacing the slow double loop with a single call to JDBC_Fetch.getData()
, followed by vectorized conversion into a Matlab cell array (matrix):
% Get the data from the ResultSet using the JDBC_Fetch wrapper data = cell(JDBC_Fetch.getData(rs)); for colIdx = 1 : numCols switch ColumnType{colIdx} case {'Float','Double'} data{colIdx} = num2cell(data{colIdx}); case {'Long','Integer','Short','BigDecimal'} data{colIdx} = num2cell(data{colIdx}); case 'Boolean' data{colIdx} = num2cell(data{colIdx}); otherwise %case {'String','Date','Time','Timestamp'} %data{colIdx} = cell(data{colIdx}); % no need to do anything here! end end data = [data{:}]; |
On my specific program the resulting speedup was 15x (this is not a typo: 15 times faster). My fetches are no longer limited by the Matlab post-processing, but rather by the DB’s processing of the SQL statement (where DB indexes, clustering, SQL tuning etc. come into play).
Additional speedups can be achieved by parsing dates at the Java level (rather than returning strings), as well as several other tweaks in the Java and Matlab code (refer to Andrew Janke’s post for some ideas). But certainly the main benefit (the 80% of the gain that was achieved in 20% of the worktime) is due to the above push of the main double processing loop down into the Java level, leaving Matlab with just a single Java call to JDBC_Fetch
.
Many additional ideas of speeding up database queries and Matlab programs in general can be found in my second book, Accelerating Matlab Performance.
If you’d like me to help you speed up your Matlab program, please email me (altmany at gmail), or fill out the query form on my consulting page.
Cool,
We do a similar thing using C# and DataTable
So I’m wondering is it really necessary to make the specific class (JDBC_Fetch.java) to iterate the rows of a ResultSet or does Java not have an equivalent to C#’s DataTable that kind of does that for you, i.e. a data-struct that returns columns instead of rows, roughly speaking, and you can already work on its columns instead of calling rs.next() in an extra for loop? Or maybe the equivalent does not give the necessary speed-up?
best,
I do not believe that JDBC has such an easy-to-use wrapper, unfortunately.
Hi
Please help me get rid of “jave heap memory error” in standalone executable (tried both windows and linux versions). I am creating a SQL connection (using JDBC) outside the infinity while loop and then checking with “isopen or isconnection” inside the loop. I read in some forum that the SQL JDBC driver caches the queries which could be the cause. I’ve also tried deleting unused variables in the loop but it didn’t help.
Thanks
Try to close the
ResultSet
andstmt
(as I’ve shown in the post) after you’ve used them, in order to clear the cache between subsequent JDBC calls, otherwise they might remain in memory and cause an eventual heap-memory error.I faced this issue with the Matlab Database Toolbox and tackled it by:
* setting the ‘DataReturnFormat’ to ‘numeric’;
* querying results by column and converting the numeric stream into its column format
Hey there,
just wanted to say thank you for this page. I found a lot of valuable information here how to handle result sets from java in matlab.
Best Regards
Paul
Yair,
Thank you for your work, most likely buying the book soon. But in the meantime I’m hitting a snag on setting up the JDBC connection on the fly. Can you provide a bit more explanation and or hints on the steps required? I can use database toolbox to do the setup and it works just fine, but I would rather be able to hardcode it so that my application/code is portable.
Hi Yair,
I have a copyright question, and a comment to your statement above: “the speed remains nearly unaffected when you pre-allocate data properly.”
One of my colleges are downloading data from Wharton Research Data Service (WRDS) for his research in Matlab.
WRDS are providing a guide for Matlab-users, and it basically includes an algorithm like the first one presented above for getting the data into Matlab’s memory.
The SQL-tables he’s downloading typically have 3 million rows but very few variables.
Initially, he was complaining about the WRDS-solution, as a query like the one above took 2-3 hours. However, once he got the data variable pre-allocated appropriately, the query was done in 5 minutes. That’s actually a factor 30(!)
With your Java-code, we can get the time down even more with something similar to the factor 15 you mention above.
This leads me to the copyright question: How should I deal with the copyright? We are only using your algorithm for research purposes and nothing commercial.
@Peter – I’m happy to hear that you found my work useful. Feel free to use my algorithm and code for any non-commercial research usage. I just ask that you add references to undocumentedmatlab.com and my book (Accelerating Matlab Performance, CRC Press, ISBN 9781482211290) in your published work.