extract all lines from text file based on a given list of IDs

Question

I have 2 text files. file1 contains a list of IDs:

file2:

10993   item    0
11002   item    6
10995   item    7
79600   item    7
439481  item    5
272557  item    7
224325  item    7
84156   item    6
572546  item    7
693661  item    7
.....

I am trying to select all lines from file2 where the ID (first column) is in file1. Currently, what I am doing is to loop through the first file to create a regex like:

^\b11002\b\|^\b10995\b\|^\b48981\b|^\b79600\b

Then run:

grep '^11002\|^10995\|^48981|^79600' file2.txt

But when the number of IDs in file1 is too large (~2000), the regular expression becomes quite long and grep becomes slow. Is there another way? I am using Perl + Awk + Unix.

I see plenty of answers already, but you might find the perl code in here adaptable, if you want: stackoverflow.com/questions/13713032/… You need to add some stuff to filter out the first col in file2, tho. — Jarmund
– Jarmund, Commented Dec 5, 2012 at 21:06

Alex Reynolds · Accepted Answer · 2012-12-05 22:48:59Z

6

Use a hash table. It can be memory-intensive but lookups are in constant time. This is an efficient and correct procedure — not the only one, but efficient and correct — for creating a hash table, using file1 as keys and file2 for looking up keys in the hash table. If a key is in the hash table, the line is printed to standard output:

#!/usr/bin/env perl

use strict;
use warnings;

open FILE1, "< file1" or die "could not open file1\n";
my $keyRef;
while (<FILE1>) {
   chomp;
   $keyRef->{$_} = 1;
}
close FILE1;

open FILE2, "< file2" or die "could not open file2\n";
while (<FILE2>) {
    chomp;
    my ($testKey, $label, $count) = split("\t", $_);
    if (defined $keyRef->{$testKey}) {
        print STDOUT "$_\n";
    }
}
close FILE2;

There are lots of ways to do the same thing in Perl. That said, I value clarity and explicitness over fancy obscurity, because you never know when you have to come back to a Perl script and make changes, and they are hard enough to manage, as it is. One person's opinion.

edited Dec 5, 2012 at 22:48

answered Dec 5, 2012 at 21:03

Alex Reynolds

97.3k59 gold badges251 silver badges356 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Josh Y. Over a year ago

TMTOWTDI: my %id; my @keys = do { open my $fh, '<', 'file1'; <$fh> }; @id{@keys} = (); … if(exists $id{$testKey}) { … }

Ed Morton · Accepted Answer · 2012-12-05 21:39:50Z

4

awk 'NR==FNR{tgts[$1]; next} $1 in tgts' file1 file2

Look:

$ cat file1
11002
10995
48981
79600
$ cat file2
10993   item    0
11002   item    6
10995   item    7
79600   item    7
439481  item    5
272557  item    7
224325  item    7
84156   item    6
572546  item    7
693661  item    7
$ awk 'NR==FNR{tgts[$1]; next} $1 in tgts' file1 file2
11002   item    6
10995   item    7
79600   item    7

edited Dec 5, 2012 at 21:39

answered Dec 5, 2012 at 21:13

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

5 Comments

jjennifer Over a year ago

it only prints the last line. which is 79600 item 7

Ed Morton Over a year ago

Then your file1 only contains 79600 as a key or your files are corrupt or you made a mistake copy/pasting my script or you're using old,m broken awk. What does awk --version tell you? I updated my answer to show the script working.

Chris Seymour Over a year ago

There is no problem with this.

Chris Seymour Over a year ago

I tested most of the answers here on Cygwin (Windows) and on Linux and found the results varied (only displaying 79600 item 7 on one platform) so be wary. I think grep -f f1 f2 is the most elegant solution here @EdMorton @jjennifer

Ed Morton Over a year ago

@sudo_O but won't the grep find false matches if the numbers from f1 appear in undesirable locations in f2, e.g. if f1 contains 10 and f2 contains 100 in field 1, or it contains 10 in some other field.

Arun Taylor · Accepted Answer · 2012-12-05 21:38:17Z

3

I would suggest using a tool designed to do just that. Use the join command. Do 'man join' for more info.

linux_prompt> join file1 file2
11002 item 6
10995 item 7
79600 item 7

answered Dec 5, 2012 at 21:38

Arun Taylor

1,5728 silver badges5 bronze badges

2 Comments

Chris Seymour Over a year ago

However requires files to be sorted.

glenn jackman Over a year ago

If it's OK for the OP that the results are in sorted order: join <(sort file1) <(sort file2)

Chris Seymour · Accepted Answer · 2012-12-05 21:56:58Z

2

Using grep:

$ grep -f f1 f2
11002   item    6
10995   item    7
79600   item    7

Note: I tested a lot of the suggested answer on multiple system and some only display the last match 79600 item 7!?

edited Dec 5, 2012 at 21:56

answered Dec 5, 2012 at 20:59

Chris Seymour

86.4k32 gold badges166 silver badges209 bronze badges

2 Comments

Mel Nicholson Over a year ago

This will get a false positive if the ID appears in the wrong column.

Chris Seymour Over a year ago

@JoshY. on cygwin grep -f f1 f2 only displays the last match, test on Linux and its fine..

Mel Nicholson · Accepted Answer · 2012-12-05 21:04:53Z

1

Load all the elements of your first file into a hash. For each line of the second file, extract the number using the regex ^(\d*) if the hash contains the extracted number, print it

answered Dec 5, 2012 at 21:04

Mel Nicholson

3,22416 silver badges25 bronze badges

Comments

glenn jackman · Accepted Answer · 2012-12-05 22:07:38Z

0

Use a process substitution to transform the ID's in file1 into regular expressions:

grep -f <(sed 's/.*/^&\\b/' file1) file2

I'm assuming you're using bash or a similarly capable shell

answered Dec 5, 2012 at 22:07

glenn jackman

249k42 gold badges233 silver badges362 bronze badges

Comments

TLP · Accepted Answer · 2012-12-05 22:50:20Z

0

Simple perl solution is to use a hash and count the number of occurrences of the sought after numbers.

perl -lanwe 'print if $a{$F[0]}++ == 1;' file1.txt file2.txt

I get the following output from your sample data:

11002   item    6
10995   item    7
79600   item    7

Note that you need to use the files in the correct order on the command line.

This will open and read the input file names (-n), autosplit the lines (-a) into @F, and then print each line, if the value in the hash for that number is equal to 1. If you want to print multiple values from file2, simply change == 1 to >= 1.

Note that the ++ operator is applied after the equality comparison is done.

answered Dec 5, 2012 at 22:50

TLP

68.2k10 gold badges97 silver badges156 bronze badges

Collectives™ on Stack Overflow

extract all lines from text file based on a given list of IDs

7 Answers 7

1 Comment

5 Comments

2 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

1 Comment

5 Comments

2 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related