13

I've looked a a number of questions on this site and cannot find an answer to the question: How to create multiple NEW tables in a database (in my case I am using PostgreSQL) from multiple CSV source files, where the new database table columns accurately reflect the data within the CSV columns?

I can write the CREATE TABLE syntax just fine, and I can read the rows/values of a CSV file(s), but does a method already exist to inspect the CSV file(s) and accurately determine the column type? Before I build my own, I wanted to check if this already existed.

If it doesn't exist already, my idea would be to use Python, CSV module, and psycopg2 module to build a python script that would:

  1. Read the CSV file(s).
  2. Based upon a subset of records (10-100 rows?), iteratively inspect each column of each row to automatically determine the right column type of the data in the CSV. Therefore, if row 1, column A had a value of 12345 (int), but row 2 of column A had a value of ABC (varchar), the system would automatically determine it should be a format varchar(5) based upon the combination of the data it found in the first two passes. This process could go on as many times as the user felt necessary to determine the likely type and size of the column.
  3. Build the CREATE TABLE query as defined by the column inspection of the CSV.
  4. Execute the create table query.
  5. Load the data into the new table.

Does a tool like this already exist within either SQL, PostgreSQL, Python, or is there another application I should be be using to accomplish this (similar to pgAdmin3)?

11
  • Are you trying to automate the creation of tables like this to account for multiple CSV sources? Each source will have its own table? Otherwise I'd think the best data inspection device would be the Mark I Eyeball. Commented Nov 5, 2012 at 19:34
  • KungFoo is right. The only other option in SQL would be the import/export wizard and using the "Suggest Types" button, it will take a sample and do its best to figure out what columns should be what. Commented Nov 5, 2012 at 19:40
  • Most programs that deal with the import of data into a database have suggest-types implementation of this. If you're going to write this yourself, I suggest looking into how excel/access/tableau/etc do this. (tableau looks at the first 16 lines of a file to determine the type.) Commented Nov 5, 2012 at 19:41
  • @iKnowKungFoo, yes, I have multiple CSV source files downloaded from multiple sources that I would like to automate the table creation for. Commented Nov 5, 2012 at 20:08
  • 1
    Well, as others have said: it is relatively useless. It could save you some typing, but you still have to understand the meaning of the columns. And csv files often have more or less meaningless column names. And: even if the import were automatic you'd still have to understand the meaning of the contents, and how it relates to your existing tables. Normally you gain this knowledge while massaging the input files into the shape you want. The work is not in the typing, but in the understanding. Commented Nov 5, 2012 at 23:49

3 Answers 3

7

I have been dealing with something similar, and ended up writing my own module to sniff datatypes by inspecting the source file. There is some wisdom among all the naysayers, but there can also be reasons this is worth doing, particularly when we don't have any control of the input data format (e.g. working with government open data), so here are some things I learned in the process:

  1. Even though it's very time consuming, it's worth running through the entire file rather than a small sample of rows. More time is wasted by having a column flagged as numeric that turns out to have text every few thousand rows and therefore fails to import.
  2. If in doubt, fail over to a text type, because it's easier to cast those to numeric or date/times later than to try and infer the data that was lost in a bad import.
  3. Check for leading zeroes in what appear otherwise to be integer columns, and import them as text if there are any - this is a common issue with ID / account numbers.
  4. Give yourself some way of manually overriding the automatically detected types for some columns, so that you can blend some semantic awareness with the benefits of automatically typing most of them.
  5. Date/time fields are a nightmare, and in my experience generally require manual processing.
  6. If you ever add data to this table later, don't attempt to repeat the type detection - get the types from the database to ensure consistency.

If you can avoid having to do automatic type detection it's worth avoiding it, but that's not always practical so I hope these tips are of some help.

Sign up to request clarification or add additional context in comments.

Comments

1

It seems that you need to know the structure up front. Just read the first line to know how many columns you got.

CSV does not carry any type information, so it has to be deduced from the context of data.

Improving on the slightly wrong answer before, you can create a temporary table with x number of text columns, fill it up with data and then process the data.

BEGIN;
CREATE TEMPORARY TABLE foo(a TEXT, b TEXT, c TEXT, ...) ON COMMIT DROP;
COPY foo FROM 'file.csv' WITH CSV;
<do the work>
END;

Word of warning, the file needs to be accessible by the postgresql process itself. That creates some security issues. Other option is to feed it through STDIN.

HTH

3 Comments

That's better. :) BTW, you may not want to restrict the lifetime of the temp table to a transaction. It dies at the end of the session anyways. Also, for repeated use, you may want to create a persistent type or dummy table and CREATE TEMPORARY TABLE foo(LIKE template_tbl).
Yeah, my previous answer was typed in too quickly without verification. You didn't have to rob me of few points tho, comment was sufficient for me to realise that it was a fsck up ;)
Naaa, the downvote was just as deserved as my upvote now. You shouldn't "type in too quickly without verification" to begin with. That's what downvotes are for: warn people that it's no good. Never personal. :)
0

Although this is quite an old question, it doesn't seem to have a satisfying answer and I was struggling with the exact samen issue. With the arrival of SQL Server Management Studio 2018 edition - and probably somewhat before that - a pretty good solution was offered by Microsoft.

  1. In SSMS on a database node in the object explorer, right-click, select 'Tasks' and choose 'Import data';
  2. Choose 'Flat file' as source and, in the General section, browse to your .csv file. An important note here: make sure there's no table in your target SQL server matching the files name;
  3. In the Advanced section, click on 'Suggest types' and in the next dialog, enter preferrably the total number of rows in your file or, if that's too much, a large enough number to cover all possible values (this takes a while);
  4. Click next, and in the subsquent step, connect to your SQL server. Now, every brand has their own flavour of data types, but you should get a nice set of relevant pointers for your taste later on. I've tested this using the SQL Server Native Client 11.0. Please leave your comments for other providers as a reply to this solution;
  5. Here it comes... click 'Edit Mappings'...;
  6. click 'Edit SQL' et voila, a nice SQL statement with all the discovered data types;
  7. Click through to the end, selecting 'Run immediately' to see all of your .csv columns created with appopriate types in your SQL server.

Extra: If you run the above steps twice, exactly the same way with the same file, the first loop will use the 'CREATE TABLE...' statement, but the second go will skip table creation. If you save the second run as an SSIS (Integration Services) file, you can later re-run the entire setup without scanning the .csv file.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.