Split CSV by column value, and keep header

Question

This has been asked many times before but I simply can't implement the solutions properly. I have a large csv named 2017-01.csv, with a date column (it's the second column in the file) and I am splitting the file by date. The original file looks like:

 date
 2017-01-01
 2017-01-01
 2017-01-01
 2017-01-02
 2017-01-02
 2017-01-02

After the split, 2017-01-01.csv looks like

2017-01-01
2017-01-01
2017-01-01

and 2017-01-02.csv looks like

2017-01-02
2017-01-02
2017-01-02

The code I am using is

awk -F ',' '{print > (""$2".csv")}' 2017.csv

Everything works fine but I need to keep the header row. So I tried

awk -F ',' 'NR==1; NR > 1{print > (""$2".csv")}' 2017-01.csv

But I still get the same results without the header row. What am I doing wrong? I read answers to many similar questions on Stackoverflow but I just can't understand what they are doing.

I want this:

2017-01-01.csv should look like

date
2017-01-01
2017-01-01
2017-01-01

2017-01-02.csv should look like

date
2017-01-02
2017-01-02
2017-01-02

Your input and output file names are looking same? is it a typo or correct thing, please confirm? — RavinderSingh13
– RavinderSingh13, Commented Jul 27, 2018 at 16:33
I have edited it again to make it clear. The input and output files are different. Let me know if it makes sense now. Thanks. — Mishal Ahmed
– Mishal Ahmed, Commented Jul 27, 2018 at 16:35
The "" in your script is doing nothing, you could just remove it. edit your question to provide sample input/output that more truly represents your real multi-columnar data so we can help you. — Ed Morton
– Ed Morton, Commented Jul 27, 2018 at 20:55

Ed Morton · Accepted Answer · 2018-07-28 03:42:23Z

4

awk -F, '
FNR==1{hdr=$2}
 FNR > 1{
   if (! hdrPrinted[$2]){
      print hdr > (""$2".csv")
      hdrPrinted[$2]=$2
  }
  print $1, $2, $3> (""$2".csv")
}' 2017-01.csv

And as a 1-liner

awk -F, ' FNR==1{hdr=$2} FNR > 1{ if (! hdrPrinted[$2]){ print hdr > (""$2".csv"); hdrPrinted[$2]=$2; } print $1, $2, $3> (""$2".csv") }' 2017-01.csv

Produces output

cat 2017\-01\-01.csv
date
  2017-01-01
  2017-01-01
  2017-01-01

cat 2017\-01\-02.csv
date
  2017-01-02
  2017-01-02
  2017-01-02

Note that FNR means FileNumber(of)Record, so each time a new file is opened, the FNR will reset to 1. This may cause problems for specific cases of processing, but generally, I think it is the better approach, allowing you to list multiple files on the cmd line, and process them all in one process.

-----------------

Per reasonable comments below, here is more bullet-proof version which should deal with the case if more than 20 files are listed on the cmd line.

I don't have an easy way to test this, so feedback is welcome.

AND per comments below, it still needs some work, which I don't have time for right now. Look for update Saturday afternoon.

awk -F, ' FNR==1{hdr=$2}  FNR > 1{
      # length() assumes newish gawk version
      if ( length(openFiles) > 20) {
             # close the first/next file in the array
             close(openFiles[++j]".csv")
             openFiles[j]=""
      }
      if (! ($2 in openFiles) ) {
             # put the filename into the openFiles array (just once)
            openFiles[++i]=$2
            }    if (! hdrPrinted[$2]){
   print hdr > (""$2".csv")
  hdrPrinted[$2]=$2   }   print $1, $2, $3> (""$2".csv") 2017-01.csv

IHTH

Edit by Ed Morton:

awk -F, '
FNR==1 { hdr=$0; next}
{
    out = $2 ".csv"
    if (!seen[out]++) {
        print hdr > out
    }
    print >> out
    close(out)
}
' file

edited Jul 28, 2018 at 3:42

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

answered Jul 27, 2018 at 19:31

shellter

37.6k7 gold badges87 silver badges96 bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

Mishal Ahmed Over a year ago

Thanks. But I get syntax error for the '=' in hdrPrinted[$2]=$2.

shellter Over a year ago

Oops, did an inplace update without dbl-checking that it really worked ;-( . Here's revised. The date values are indented because field $1 is included in the output, as you indicated you need to include other fields in your real problem. Good luck.

shellter Over a year ago

So you'll still have to play with what you assign to hdr and what fields (and in what order) you want to output. A printf("%s\t\%s\n, $1, $2) sort of statement will give you a lot more flexibility in your output.

shellter Over a year ago

that is basic cmd and should work. Please send output of uname -svr ; awk --version.

Mishal Ahmed Over a year ago

Okay, your code worked but only if I save it as a awk file and run it as sh file.awk. At first, I was trying to put all your code in 1 line and running directly on bash and that didn't work. Any idea why? How can I change it into a one-liner that I can copy/paste into bash directly?

|

AHT · Accepted Answer · 2018-07-27 19:56:40Z

0

The following is tested on a csv containing multiple columns with column two set to date:

awk -F',' 'prev!=$2{close(prev".csv");print "date" > ($2".csv")}{print $2 > ($2".csv");prev=$2}' Input_file

hth

answered Jul 27, 2018 at 19:56

AHT

4141 gold badge3 silver badges17 bronze badges

3 Comments

Mishal Ahmed Over a year ago

Your code works but has the same problem as with Ravinder's answer. I have multiple columns other than "date". So I need my header to be more than just "date". Could you tell me how to print NR==1{header=$0; next} before beginning of each csv?

Ed Morton Over a year ago

Surely it's obvious that if your real input file has more columns than just date then the sample input you provided for us to test a potential solution against should also have more columns than just date.

Mishal Ahmed Over a year ago

It was supposed to be a working example, not a copy/paste from my original dataset. I did mention I have more than 1 column. So no, it wasn't obvious to me. Thank you for your help.

Collectives™ on Stack Overflow

Split CSV by column value, and keep header

2 Answers 2

-----------------

14 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

-----------------

14 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related