Extract lines containing one of large number of strings from file

Question

Let's say I have a file DATA with 10,000,000 lines. I have another file IDS with 100,000 strings. I want to extract all lines from DATA that contain one of the strings from IDS. An additional condition is that there is a 1:1 relationship between the files, so every ID has one line of DATA and every DATA has one ID.

What is the most efficient, least complicated way to do this using standard linux command-line utilities?

My ideas so far:

Build a huge regex and use grep (easy, may exceeed some limit within grep)
Go through IDS line by line and grep DATA for each string separately, merge results. (easy, probably very inefficient)
Build a hashmap of IDS in python, loop through DATA, extract ID and check against hash map (a bit harder)

I am trying to use join now (which should be exactly what I need) but I am running into some trouble with sorting my data: stackoverflow.com/questions/15133894/… — Nils
– Nils, Commented Feb 28, 2013 at 11:06

Maxim Razin · Accepted Answer · 2013-02-27 18:59:11Z

3

grep -F -f IDS DATA

Don't miss -F: it prevents from interpreting IDS as regular expressions, and enables a much more efficient Aho-Korasick algorithm.

answered Feb 27, 2013 at 18:59

Maxim Razin

9,5067 gold badges36 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alex · Accepted Answer · 2013-02-27 18:55:08Z

2

If IDS contains the exact strings you need to find in DATA, one string per line, try using

grep --file=IDS DATA > results

answered Feb 27, 2013 at 18:55

Alex

11.2k9 gold badges42 silver badges64 bronze badges

Collectives™ on Stack Overflow

Extract lines containing one of large number of strings from file

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related