0

I have 1 file of emaiils (4.6m lines) I have another file of emails (100m lines).

I want to see how many of these 4.6m lines occur in the file which has 100m lines.

I've researched already, and tried the following:
"grep -f file1 file2 > output.txt" < to no avail.
"grep -wFf file1 file2 > output.txt" < to no avail.

I'm using Cygwin for this, and both of the above commands "run" (there's no error mesage" then after X amount of time it's finished and nothing has been outputted to "output.txt"

1
  • 1
    By "emails" you mean email addresses? And your goal is to find out which occur in both file? Can there be duplicates in one file or is each line unique? Commented Feb 14, 2019 at 4:54

1 Answer 1

1
comm -12 <(sort file1) <(sort file2) | wc -l

Explanation

  • comm -12 foo bar: this will look for matching lines in the files foo and bar, but it requires the files to be sorted, hence,
  • <(sort file1) will sort each file before sending it to comm.
  • | wc -l: after printing the matching lines, pipe them into wc, which will count the number of lines.

Caveat

This looks for lines that match exactly. Things like inconsistent line breaks will prevent the lines from matching.

11
  • It ran for quite a while, output file is contains 2 lines, being this: i.imgur.com/i0eYpxu.png Command used: comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l > youwantme.txt Commented Feb 14, 2019 at 2:39
  • 3
    Upvoted for explaining the command you use. I wish more people would do that! Commented Feb 14, 2019 at 2:56
  • @StackStackAndStack That file contains the number of common lines, which is 0! You can verify that the command works on your system by creating two dummy files with some matching lines, and seeing if the output to the command is correct. If you are getting 0 for your real files, then (as per my caveat) there are likely minor differences between the supposedly matching lines. You'll have to post an example of lines that you expect to match so I can troubleshoot further. Please edit your question if you do so. Commented Feb 14, 2019 at 3:06
  • @Sparhawk The issue there, there are thousands of common lines (guaranteed). I'm sorry if I'm being useless, it's just separate files each containing emails ONLY emails. Commented Feb 14, 2019 at 4:05
  • 1
    @StackStackAndStack I don't disbelieve you! So either the script doesn't work (you need to test it with, say a\nb\nc and b\nc\nd), and/or their are minor (potentially invisible) differences between the file (you need to post example lines from the files that you expect to match). Commented Feb 14, 2019 at 4:54

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.