I am looking for a one liner that could be run in a linux terminal that does the below.

Takes as input a tab separated file (tsv) with many columns (~100) and creates a two column tsv output with first column as the column name and second column having the distinct values of each column. Minimal example below

Input:

main_pos first_pos second_pos
e1 green round
e2 green square

Expected output:

column_name distinct_values
main_pos e1,e2
first_pos green
second_pos round,square

It is not really required to have any header in the output. Just the column names and distinct values alone also would suffice.

7 Replies 7

what have you tried so far?

why one liner, what difference does it make?

Do you have to do this in a shell script? It would be much easier in a language with better datatypes.

What you want is an associative array of sets. The array would be keyed by column names, and the sets would contain the unique values. bash has associative arrays, but the values are just strings; it doesn't have sets; while you may be able to emulate this with delimited strings, doing it in a one-liner would be complex (almost anything can be a one-liner in shell, if you don't care how long the line is).

awk also has associative arrays, but it also doesn't have sets or nested arrays.

It won't be a one-liner, but doing this in Python will be relatively easy, especially using pandas. See Replicating GROUP_CONCAT for pandas.DataFrame for example.

@Jeffin Rockey : no what you're actually looking for is someone else to do your homework.

Using a couple of generally useful tools - csvjson from the csvkit bundle to convert the CSV input into a JSON document, and good old jq to do the heavy lifting:

$ csvjson -t data.tsv |
    jq -r 'map(to_entries) | flatten | group_by(.key)
         | map([.[0].key, (map(.value) | unique | join(","))] | @tsv)[]'
first_pos       green
main_pos        e1,e2
second_pos      round,square

csvjson transforms a CSV/TSV file into an array of objects, one per line, with the column names as field names:

$ csvjson -i2 -t data.tsv
[
  {
    "main_pos": "e1",
    "first_pos": "green",
    "second_pos": "round"
  },
  {
    "main_pos": "e2",
    "first_pos": "green",
    "second_pos": "square"
  }
]

The jq expression takes that, turns each object into an array of key/value objects (One per element in each object), flattens them all out back into one array, and then groups the elements based on key, and for each group gets the unique values and turns them into a comma-separated string, finally converting back to TSV for output.

perl -F\\t -lE'map$%[$_]{$F[$_]}++,keys@$ or@$=@F}{say"$$[$_]\t",join",",keys%{$%[$_]}for keys@$' input.tsv

awk probably needed to be longer:

awk -F\\t '{for(i=n;i;--i)s[i,$i]++||v[i]=v[i]c$i}!n{n=split($0,k)}NR==2{c=","}END{for(j in k)print k[j]FS v[j]}' input.tsv

or if using busybox, or row order is important:

awk -F\\t '{for(i=n;i;--i)s[i,$i]++||(v[i]=v[i]c$i)}!n{n=split($0,k)}NR==2{c=","}END{while(j++<NF)print k[j]FS v[j]}' input.tsv

Note that simply storing any script in a file, making it executable, and ensuring it is in your path, obviates the need for one-liners in most cases. Presumably you don't expect to type out the entire source-code of less on the command-line each time you use it, nor assign its source-code to an alias.


perl -F\\t -lE '
    map $%[$_]{$F[$_]}++, keys @$
      or @$ = @F
  }{
    say "$$[$_]\t", join ",", keys %{$%[$_]}
      for keys @$
' input.tsv
  • @F is array of current row's column values, indexed by column number
    • automatically populated by splitting input lines with -F regex
  • @$ is array of columns of first row, indexed by column number
  • @% is array of hashes (unique rows (>1) of a column), indexed by column number
  • map builds a list (which is discarded but has side-effect of adding the unique row-of-column values as they are found) and so evaluates to false when @$ is empty since the resulting list is also empty, which triggers @$ to be initialised (rhs of or)
  • with -n option (implied by -F), }{ ... makes ... happen after all input has been processed
    • loop over indices of @$ printing lines built from the corresponding element of @$ and the list of keys from the corresponding hash element of @%
  • note: elements of the "distinct_values" column appear in apparently-random order since result of keys on a hash is not sorted
awk -F\\t '
    {
        for (i = n; i; --i)
            s[i,$i]++ || (v[i] = v[i] c $i)
    }
    !n { n = split($0,k) }
    NR==2 { c = "," }
    END {
        while (j++<NF)
            print k[j] FS v[j]
    }
' input.tsv
  • k is array of columns of first row, indexed by column number

  • s is array (hash) whose keys are the unique rows (>1) of every column seen so far and values are the count of times seen

  • v[i] stores string built from unique rows (>1) of ith column

  • when reading first line, n is not set, so nothing is added to v and then k is generated

  • differences from the shorter awk version:

    • uses a||b=c but busybox needs a||(b=c)
    • uses while to ensure output row i corresponds to input column i
      • in standard awk, for(j in k)return elements of k in unspecified order

You can use Ruby which is available widely on most linux distros:

ruby -e 'puts "column_name  distinct_values"
$<.read.split(/\R/).
    map{|e| e.split(/\t/)}.
    transpose.
    map{|a| [a[0],a[1..].uniq ]}.
    each{|sa| puts "#{sa[0]}\t#{sa[1].join(",")}"}
' file 

Prints:

column_name distinct_values
main_pos    e1,e2
first_pos   green
second_pos  round,square

The header values are assumed to be unique in your input file.

@Diego I had checked some possibilities by piping various subcommands of xsv, xan, csvkit etc. Then got a long awk snippet which was technically one line and it did work. However it took more 10-15 seconds on a large dataset.

On oneliner need, this is not for a one time use or a script. I want to set an alias to the solution and want to use it like less -S or cat or so on many files on a daily basis. Also I wanted to really reach a more efficient solution.

Many thanks to @Barmar, @Shawn, @jhnc and @dawg for the suggested solutions.

The perl based solution from @jhnc do meet the requirements I wanted help with. It is way quicker than the awk snippet I had (and shorter also).

@RARE, I can agree that the in/out example do look like a homework. Getting it done without getting into a R/python/shell script was the difficult part.

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.