2

This has been asked many times before but I simply can't implement the solutions properly. I have a large csv named 2017-01.csv, with a date column (it's the second column in the file) and I am splitting the file by date. The original file looks like:

 date
 2017-01-01
 2017-01-01
 2017-01-01
 2017-01-02
 2017-01-02
 2017-01-02

After the split, 2017-01-01.csv looks like

2017-01-01
2017-01-01
2017-01-01

and 2017-01-02.csv looks like

2017-01-02
2017-01-02
2017-01-02

The code I am using is

awk -F ',' '{print > (""$2".csv")}' 2017.csv

Everything works fine but I need to keep the header row. So I tried

awk -F ',' 'NR==1; NR > 1{print > (""$2".csv")}' 2017-01.csv

But I still get the same results without the header row. What am I doing wrong? I read answers to many similar questions on Stackoverflow but I just can't understand what they are doing.

I want this:

2017-01-01.csv should look like

date
2017-01-01
2017-01-01
2017-01-01

2017-01-02.csv should look like

date
2017-01-02
2017-01-02
2017-01-02
4
  • Your input and output file names are looking same? is it a typo or correct thing, please confirm? Commented Jul 27, 2018 at 16:33
  • I have edited it again to make it clear. The input and output files are different. Let me know if it makes sense now. Thanks. Commented Jul 27, 2018 at 16:35
  • Please check my answer and let me know if that helps you? Commented Jul 27, 2018 at 16:39
  • The "" in your script is doing nothing, you could just remove it. edit your question to provide sample input/output that more truly represents your real multi-columnar data so we can help you. Commented Jul 27, 2018 at 20:55

2 Answers 2

4
awk -F, '
FNR==1{hdr=$2}
 FNR > 1{
   if (! hdrPrinted[$2]){
      print hdr > (""$2".csv")
      hdrPrinted[$2]=$2
  }
  print $1, $2, $3> (""$2".csv")
}' 2017-01.csv

And as a 1-liner

awk -F, ' FNR==1{hdr=$2} FNR > 1{ if (! hdrPrinted[$2]){ print hdr > (""$2".csv"); hdrPrinted[$2]=$2; } print $1, $2, $3> (""$2".csv") }' 2017-01.csv

Produces output

cat 2017\-01\-01.csv
date
  2017-01-01
  2017-01-01
  2017-01-01

cat 2017\-01\-02.csv
date
  2017-01-02
  2017-01-02
  2017-01-02

Note that FNR means FileNumber(of)Record, so each time a new file is opened, the FNR will reset to 1. This may cause problems for specific cases of processing, but generally, I think it is the better approach, allowing you to list multiple files on the cmd line, and process them all in one process.

-----------------

Per reasonable comments below, here is more bullet-proof version which should deal with the case if more than 20 files are listed on the cmd line.

I don't have an easy way to test this, so feedback is welcome.

AND per comments below, it still needs some work, which I don't have time for right now. Look for update Saturday afternoon.

awk -F, ' FNR==1{hdr=$2}  FNR > 1{
      # length() assumes newish gawk version
      if ( length(openFiles) > 20) {
             # close the first/next file in the array
             close(openFiles[++j]".csv")
             openFiles[j]=""
      }
      if (! ($2 in openFiles) ) {
             # put the filename into the openFiles array (just once)
            openFiles[++i]=$2
            }    if (! hdrPrinted[$2]){
   print hdr > (""$2".csv")
  hdrPrinted[$2]=$2   }   print $1, $2, $3> (""$2".csv") 2017-01.csv

IHTH

Edit by Ed Morton:

awk -F, '
FNR==1 { hdr=$0; next}
{
    out = $2 ".csv"
    if (!seen[out]++) {
        print hdr > out
    }
    print >> out
    close(out)
}
' file
Sign up to request clarification or add additional context in comments.

14 Comments

Thanks. But I get syntax error for the '=' in hdrPrinted[$2]=$2.
Oops, did an inplace update without dbl-checking that it really worked ;-( . Here's revised. The date values are indented because field $1 is included in the output, as you indicated you need to include other fields in your real problem. Good luck.
So you'll still have to play with what you assign to hdr and what fields (and in what order) you want to output. A printf("%s\t\%s\n, $1, $2) sort of statement will give you a lot more flexibility in your output.
that is basic cmd and should work. Please send output of uname -svr ; awk --version.
Okay, your code worked but only if I save it as a awk file and run it as sh file.awk. At first, I was trying to put all your code in 1 line and running directly on bash and that didn't work. Any idea why? How can I change it into a one-liner that I can copy/paste into bash directly?
|
0

The following is tested on a csv containing multiple columns with column two set to date:

awk -F',' 'prev!=$2{close(prev".csv");print "date" > ($2".csv")}{print $2 > ($2".csv");prev=$2}' Input_file

hth

3 Comments

Your code works but has the same problem as with Ravinder's answer. I have multiple columns other than "date". So I need my header to be more than just "date". Could you tell me how to print NR==1{header=$0; next} before beginning of each csv?
Surely it's obvious that if your real input file has more columns than just date then the sample input you provided for us to test a potential solution against should also have more columns than just date.
It was supposed to be a working example, not a copy/paste from my original dataset. I did mention I have more than 1 column. So no, it wasn't obvious to me. Thank you for your help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.