4

I'm running into a peculiar issue when trying to process large excel files (300mb+) using a data reader. The following code illustrates the way I open the excel file and iterate over the rows in sheet 'largesheet$':

const string inputFilePath = @"C:\largefile.xlsx";
const string connectionString =
    "Provider=Microsoft.ACE.OLEDB.12.0;Extended Properties=\"Excel 12.0;IMEX=1;HDR=YES;\";Data Source=" +
    inputFilePath;

// Initialize connection
using (var connection = new OleDbConnection(connectionString))
{
    // Open connection
    connection.Open();

    // Configure command
    var command = new OleDbCommand("largesheet$", connection) {CommandType = CommandType.TableDirect};

    // Execute reader
    var reader = command.ExecuteReader(); // <-- Completely loads file/sheet into memory

    // Iterate results
    while (reader.HasRows)
    {
        // Read single row
        reader.Read();

        // ...
    }

    // Close connection
    connection.Close();
}

In my understanding this should open the excel file and load each row when needed by using the reader.Read() statement.
However, it appears that the ExecuteReader() statement does more than returning an OleDbDataReader instance. Using breakpoints I noticed that that one statement takes 30s+, and the Windows Resource Monitor indicates a steady increase of allocated memory during the execution of that statement.
Specifying the CommandBehavior parameter (e.g. SequentialAccess) of the ExecuteReader() method has no effect.

What am I doing wrong here? Are there alternative ways of processing large (excel) files?

Note: the IMEX & HDR extended properties of the connection string are intentional.

Edit: After some rational thinking I assume it is not possible to process an excel file without buffering it one way or another. Since excel files are basically a glorified collection of compressed XML files it is not possible to process a worksheet without decompressing it (and keeping it in ram or temporarily saving to disk).
The only alternative I can think of is using Microsoft.Office.Interop.Excel. Not sure how OpenXML handles it though.

1 Answer 1

3

From MSDN: "All rows and columns of the named table or tables will be returned when you call one of the Execute methods of a Command object." (under the Remarks section). So this would appear to be the default behavior of ExecuteReader().

ExecuteReader(CommandBehavior) may give you more options, particularly when CommandBehavior is set to SequentialAccess, though you would need to handle reading at the byte level.

Sign up to request clarification or add additional context in comments.

2 Comments

As I stated in my question, the use of SequentialAccess makes no difference (I have only tried this in combination with TableDirect). I find it peculiar that ExecuteReader loads the data into memory, I thought one of the advantages of data readers was their agility and performance. Does the .NET framework offer any alternatives that are more suitable for processing large data files?
Have you considered using OpenXML (msdn.microsoft.com/en-us/library/office/bb448854.aspx) for your tasks? This may give you more control, depending on what you need to do with the data once it is loaded. You could use the SAX-like reading provided by the SDK to control how the file is read.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.