0

Greetings Support Community,

I have about 10 million+ files that I am trying to load into MySQL database using the following script:

WORKING_DIR=/tmp
FILE1="*test*"
timestamp_format="%Y-%m-%d %H:%i:%s.%x"

for i in ${WORKING_DIR}/${FILE1}
do
    if [ -f "$i" ]; then
    mysql -uroot -ptest my_database --local-infile=1<<-SQL
    SET sql_log_bin=0;
    LOAD DATA LOCAL INFILE '${i}' INTO TABLE my_table
    FIELDS TERMINATED BY ','
    OPTIONALLY ENCLOSED BY '\"'
    LINES  TERMINATED BY '\n'
    IGNORE 1 LINES
    (id, transaction_id, app_id, sub_id);
    SQL
    fi
done

Its an extremely slow process. After about 24 hours, I've only been able to load about 2 million records. In each file, there is one record. At this rate, this will complete in about 5 days. Is there a faster way of doing this? E.g. Should I concatenate the files before processing?

Any suggestion to improve loading this data into MySQL would be greatly appreciated.

Thanks!

7
  • If the suggestion is to concatenate the files before processing, how would I efficiently concatenate the 10 million+ files? Thank you! Commented Sep 20, 2016 at 18:46
  • what operating system? Are they all in one directory? As an aside, why would a file contain 1 row? Oh, /tmp, Linux. Commented Sep 20, 2016 at 19:46
  • CentOS 7. They are all in one directory. Thank you!! Commented Sep 20, 2016 at 19:47
  • Alright I know just the guy to ask. Give me a bit. Commented Sep 20, 2016 at 19:47
  • I see you have IGNORE 1 LINES - does that mean that each file has a header row? If you concatenate the files you may need to remove the header row. Commented Sep 20, 2016 at 20:01

1 Answer 1

1

You ask (in a comment) how to concatenate your files. That would be

cat /tmp/*test1*

though apparently you actually want to omit the first line from each:

awk 'FNR>1' /tmp/*test1*

How to make your SQL version read from standard input is beyond my competence. If you can't, maybe save the output to a temporary file, and process that.

If you get "argument list too long" maybe try

find /tmp -maxdepth 1 -type f -name '*test1*' -exec awk 'FNR>1' {} +

The -maxdepth 1 says not to descend into subdirectories; take it out if that's not what you want.

The -exec with a plus might not be available on really old systems; try with \; in its place if you get a syntax error (though there can be a rather unpleasant performance penalty).

I don't see that the variables made anything clearer, easier, more readable, or more mainatainable, so I simply took them out.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.