1

I've written a nice import for my million row CSV that works quite nicely (using OPENROWSET BULK (I didn't use BULK INSERT because I need to cross join with some other columns). The formatfile uses a comma as the Terminator.

Here is an example of the CSV I was using to develop:

Reference, Name, Street
1,Dave Smith, 1 Test Street
2,Sally SMith,1 Test Street

Once I'd got it working, someone reminded me that the data itself could have a comma in it, whoops!!!!

Reference, Name, Street
"1","Dave Smith", "1 Test Street"
"2","Sally Smith","1,Test Street" <-comma in street

How do people deal with a CSV with a comma in the data using FormatFiles? (Or do I just say that the file must be TAB delimited)?

1
  • Most import utilities I'm aware of will ignore commas (or other delimiters) inside of strings - they consider it part of the string. Besides which, what happens if there ends up being a tab character in the string anyways (between the quotes)? You'd end up with the same problem. See if you can specify the string delimiter, and you're probably going to be fine. Commented Jul 16, 2012 at 15:32

2 Answers 2

2

If your field terminator can occur in the data, then it may be best to use TAB or PIPE delimiter (or whatever works for your data).

If a terminator character occurs within the data, it is interpreted as a terminator, not as data, and the data after that character is interpreted as belonging to the next field or record. Therefore, choose your terminators carefully to make sure that they never appear in your data

http://msdn.microsoft.com/en-us/library/ms191485.aspx

Sign up to request clarification or add additional context in comments.

Comments

0

Delimiters within data fields is a common problem with delimited files. Some common tactics to address this include:

  1. Recreate the data file with all occurrences of the delimiter stripped out of the data fields before they are written to file: this eliminates OPENROWSET errors, but does not preserve the integrity of the data.
  2. Recreate the data file with a different delimiter character: in my experience, a tab delimiter is a better choice. It's less common to encounter a tab character than a comma within data. But it's certainly not unheard of. I've seen tabs within data too.
  3. Enclose data fields in double quotes: this requires some tweaks to the XML format file.

Manually editing the data file might be do-able for any of the above options. But it can be tedious, especially for large files. (Just opening a file of several GB's in Notepad.exe is an exercise in patience.) Realistically, you'd want the author to re-create it for you. Option #1 should always "work". But again, there's the data integrity issue you may not be able to live with. Option #2 will probably work for many cases, but it's not bulletproof. Option #3 isn't bulletproof either (it's always possible to have a delimiter within a data field), but it's about as close as you can get. Plus, it preserves data integrity.

Here is one possibility for your XML format file:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='","' MAX_LENGTH="5"/>
  <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='","' MAX_LENGTH="128" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='"\r\n' MAX_LENGTH="128" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="Reference" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="2" NAME="Name" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="3" NAME="Street" xsi:type="SQLVARYCHAR"/>
 </ROW>
</BCPFORMAT>

Notice the FIELD TERMINATOR: I used single quotes to identify "," as the terminator and "\r\n as the row terminator (the COLUMN 3 terminator). I made an educated guess that Name and Street are up to 128 characters--edit that as needed.

Problems:

  1. OPENROWSET() queries will return Reference with a leading " double quote character. And because of that...
  2. Reference cannot be returned as an INT (or SMALLINT, BIGINT, etc.). It gets returned as a VARCHAR (xsi:type="SQLVARYCHAR")

For the particular data sample provided, I'd remove the double-quotes from Reference data fields, adjust the XML format file so that FIELD ID="1" has TERMINATOR=',"', and further adjust the XML format file so that COLUMN SOURCE="1" has xsi:type="SQLINT".

For some additional info, this blog post may help: Getting Started With OPENROWSET and the BULK Rowset Provider - Part 2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.