BULK INSERT / OPENROWSET FormatFile Terminator for CSV file with , (comma) in the data

Question

I've written a nice import for my million row CSV that works quite nicely (using OPENROWSET BULK (I didn't use BULK INSERT because I need to cross join with some other columns). The formatfile uses a comma as the Terminator.

Here is an example of the CSV I was using to develop:

Reference, Name, Street
1,Dave Smith, 1 Test Street
2,Sally SMith,1 Test Street

Once I'd got it working, someone reminded me that the data itself could have a comma in it, whoops!!!!

Reference, Name, Street
"1","Dave Smith", "1 Test Street"
"2","Sally Smith","1,Test Street" <-comma in street

How do people deal with a CSV with a comma in the data using FormatFiles? (Or do I just say that the file must be TAB delimited)?

Most import utilities I'm aware of will ignore commas (or other delimiters) inside of strings - they consider it part of the string. Besides which, what happens if there ends up being a tab character in the string anyways (between the quotes)? You'd end up with the same problem. See if you can specify the string delimiter, and you're probably going to be fine. — Clockwork-Muse
– Clockwork-Muse, Commented Jul 16, 2012 at 15:32

Tim · Accepted Answer · 2012-07-16 17:51:53Z

2

If your field terminator can occur in the data, then it may be best to use TAB or PIPE delimiter (or whatever works for your data).

If a terminator character occurs within the data, it is interpreted as a terminator, not as data, and the data after that character is interpreted as belonging to the next field or record. Therefore, choose your terminators carefully to make sure that they never appear in your data

http://msdn.microsoft.com/en-us/library/ms191485.aspx

edited Jul 16, 2012 at 17:51

answered Jul 16, 2012 at 17:43

Tim

8,99031 gold badges111 silver badges197 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Fuck StackOverflow · Accepted Answer · 2016-01-30 20:33:07Z

Delimiters within data fields is a common problem with delimited files. Some common tactics to address this include:

Recreate the data file with all occurrences of the delimiter stripped out of the data fields before they are written to file: this eliminates OPENROWSET errors, but does not preserve the integrity of the data.
Recreate the data file with a different delimiter character: in my experience, a tab delimiter is a better choice. It's less common to encounter a tab character than a comma within data. But it's certainly not unheard of. I've seen tabs within data too.
Enclose data fields in double quotes: this requires some tweaks to the XML format file.

Manually editing the data file might be do-able for any of the above options. But it can be tedious, especially for large files. (Just opening a file of several GB's in Notepad.exe is an exercise in patience.) Realistically, you'd want the author to re-create it for you. Option #1 should always "work". But again, there's the data integrity issue you may not be able to live with. Option #2 will probably work for many cases, but it's not bulletproof. Option #3 isn't bulletproof either (it's always possible to have a delimiter within a data field), but it's about as close as you can get. Plus, it preserves data integrity.

Here is one possibility for your XML format file:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='","' MAX_LENGTH="5"/>
  <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='","' MAX_LENGTH="128" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='"\r\n' MAX_LENGTH="128" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="Reference" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="2" NAME="Name" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="3" NAME="Street" xsi:type="SQLVARYCHAR"/>
 </ROW>
</BCPFORMAT>

Notice the FIELD TERMINATOR: I used single quotes to identify "," as the terminator and "\r\n as the row terminator (the COLUMN 3 terminator). I made an educated guess that Name and Street are up to 128 characters--edit that as needed.

Problems:

OPENROWSET() queries will return Reference with a leading " double quote character. And because of that...
Reference cannot be returned as an INT (or SMALLINT, BIGINT, etc.). It gets returned as a VARCHAR (xsi:type="SQLVARYCHAR")

For the particular data sample provided, I'd remove the double-quotes from Reference data fields, adjust the XML format file so that FIELD ID="1" has TERMINATOR=',"', and further adjust the XML format file so that COLUMN SOURCE="1" has xsi:type="SQLINT".

For some additional info, this blog post may help: Getting Started With OPENROWSET and the BULK Rowset Provider - Part 2

Collectives™ on Stack Overflow

BULK INSERT / OPENROWSET FormatFile Terminator for CSV file with , (comma) in the data

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related