During my quest to find the fastest method to get data from Java to SQL Server, I have noticed that the fastest Java-method I can come up with, is still 12 times slower than using BULK INSERT.
My data is being generated from within Java, and BULK INSERT only supports reading data from a text file, so using BULK INSERT is not an option unless I output my data to a temporary text file. This in turn, would of course be a huge performance hit.
When inserting from Java, insert speeds are around 2500 rows per second. Even when I measure the time after the for loop, and just before the executeBatch. So "creating" the data in-memory is not the bottleneck.
When inserting with BATCH INSERT, insert speeds are around 30000 rows per second.
Both tests have been done on the server. So network is also not a bottleneck. Any clue as to why BATCH INSERT is faster? And, if the same performance can be attained from within Java?
This is just a big dataset that needs to get loaded once. So it would be OK to temporary disable any kind of logging (already tried simple logging), disable indexes (table has none), locking, whatever, ...
My test-setup so far
Database:
CREATE TABLE TestTable
( Col1 varchar(50)
, Col2 int);
Java:
// This seems to be essential to get good speeds, otherwise batching is not used.
conn.setAutoCommit(false);
PreparedStatement prepStmt = conn.prepareStatement("INSERT INTO TestTable (Col1, Col2) VALUES (?, ?)");
for (int i = 1; i <= 10000; i++) {
prepStmt.setString(1,"X");
prepStmt.setInt(2,100);
prepStmt.addBatch();
}
prepStmt.executeBatch();
conn.commit();
BULK INSERT:
// A text file containing "X 100" over and over again... so the same data as generated in JAVA
bulk insert TestTable FROM 'c:\test\test.txt';
executeBatch()every 100 rows or so.BULK INSERTlocally from SQLServer Management Studio, it can communicate with the DB using Local Named Pipes protocol, which is way faster than JDBC over TCP/IP (even within localhost). Also,BULK INSERTis designed and optimized for loading massive amounts of data, so it's really not a fair comparison. However, (based on the provided snippet) it looks like you're re-declaring the prepared statement for each batch; you could only declare it once in the beginning to save some time. Also, commit only once, after all batches have been processed.