Let's say I have a file DATA with 10,000,000 lines. I have another file IDS with 100,000 strings. I want to extract all lines from DATA that contain one of the strings from IDS. An additional condition is that there is a 1:1 relationship between the files, so every ID has one line of DATA and every DATA has one ID.
What is the most efficient, least complicated way to do this using standard linux command-line utilities?
My ideas so far:
- Build a huge regex and use grep (easy, may exceeed some limit within grep)
- Go through IDS line by line and grep DATA for each string separately, merge results. (easy, probably very inefficient)
- Build a hashmap of IDS in python, loop through DATA, extract ID and check against hash map (a bit harder)
man join5. Use a real databasejoinnow (which should be exactly what I need) but I am running into some trouble with sorting my data: stackoverflow.com/questions/15133894/…