1

I have 2 files.

File1 has 400k numerical records. e.g:

1
2
3
4
5
6
..
and so on

File 2 also has 420k numerical records. e.g:

1
2
3
4
6 
..
and so on

Both these file are in unsorted manner. I want to match the 2 file's and print the difference.

When I try using diff, comm, or grep it take a long time (more than an hour). This is not feasible for me.

How can I do this faster (matching and printing the difference).

I use HP -UX.

4
  • 3
    What kind of output are you expecting? 1 hour to compare files a couple of megabytes large. Is your machine from the 80s? What exactly did you try? Commented Feb 9, 2013 at 8:16
  • 1
    @StephaneChazelas: if he is using HP-UX then probably his hardware is from the 1980s or tops 1990s. Commented Feb 9, 2013 at 9:19
  • ... In which case, unless you are connecting from a real DEC VT-100 terminal or a ZX-81, copying the files to your local workstation and doing the comparison there might be a good workaround. Commented Feb 9, 2013 at 9:21
  • What differences do you need? Line by line, know if there are new values? Commented Feb 9, 2013 at 14:53

1 Answer 1

2

On a 10 million line file, generated with:

seq 10000000 |
  tee a |
  awk 'rand() < 0.05 {print int(1000000 * rand())}; 1' > b

all of:

diff a b | wc -l

comm -3 <(sort a) <(sort b) | wc -l

(ksh/bash/zsh syntax)

cmp -l a b | wc -l

Took under 30 seconds on a 3 year old low end PC (running Linux).

There could be big variations with diff depending on the content as diff algorithm that needs to detect insertions, deletions or changes will be affected by how the data is laid out, but there wouldn't be so much variation with the other ones.

What exactly did you try?

2
  • 2
    "How much memory" is also a reasonable question. If you only have 128M of RAM, swapping (especially to old, slow disks) will dominate completely over the "real work". Switching to explicit temporary files could be a huge win in that scenario. Commented Feb 9, 2013 at 11:36
  • @tripleee comm would need to hold a single line from each file in memory. Sorting is done out-of-core by default, if needed. Commented Feb 28 at 16:29

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.