0

I have one file (lets call it enrolled_students.txt) that I need to read in Perl. This file will have data per line such that it requires to refer other files for getting some more information.

For example, the main database will have names and addresses. But depending on the nationality of each person, I have to refer other files (sorted by country) to find the matching name, the nationality and home address.

Lets say I have 100 name_of_country.txt files and there are 10,000 lines in my enrolled_students.txt. My questions are:

  • Do I read each line in enrolled_students.txt and parse the other 100 files one by to find a match? That seems like an awful way to process this data. Is there a faster way to do this?
  • Can I execute this process in parallel mode (multithread)?

Thanks, Hans

8
  • 2
    "That seems like an awful way to process this data." Yep. "Is there a faster way to do this?" Use a database instead of flat text files. Commented Jan 14, 2015 at 22:52
  • The raw data is available as txt files only and the requirement is to use perl only. Commented Jan 14, 2015 at 22:56
  • 1
    Take a look at DBD::SQLite. It is self-contained (you don't have to install a separate database server) and will probably be much faster than anything you hack together yourself. Of course, you would have to load the initial set of data into your database first...do these text files change often? Commented Jan 14, 2015 at 23:13
  • If your files are in CSV format, you can use DBD::CSV. Otherwise, do the same thing and import the files into SQLite tables, or at the very least, CSV files, prior to processing, as others have suggested. Commented Jan 15, 2015 at 0:42
  • 1
    show some sample lines from the various files; it's not perfectly clear what you are asking. also, you talk about reading the files; what do you want done with the information? if output to a file, show a sample of that too Commented Jan 15, 2015 at 2:11

2 Answers 2

1

What you are trying to do here is similar to what a database engine has to do when joining data from two tables together. A database engine will typically have a number of different join plans to choose from, and it will attempt to choose the best one based on what it knows about the data in each table.

The same applies to you. There are several ways to join the data and the best way will depend on factors such as the size of each of the input files, whether they are pre-sorted, etc.

Some possible approaches:

  1. A 'Nested Loop', where you read each line of the enrolled_students.txt file and for each of those iterate through the other file(s) to find a match. Not likely to be very fast, you would probably only choose this if the files were too large to make any other solution practical.

  2. A 'Hash Join', where you would read one half of the data to be joined (in your example, probably the name_of_country.txt) into a data structure indexed by a hash. Then for each row of the other file, you can look up the corresponding row in the hash. This can be quite high performance, as long as there is enough memory to store at least one of the two sets of data at once.

  3. If both files are in some sorted order, sorted according to the same key, you might be able to use a 'Merge Join'. This is where you read rows from both files at once, matching the records together like teeth in a zipper.

The above assumes a simple case with two data files that have to be joined. Your question talks about 100 different name_of_country.txt files, which might complicate matters.

In regard to your second question - can you use parallel processing - that would probably only be useful if the processing was CPU-bound. The complexity of producing a forked or threaded solution is probably not warranted unless you find that it is actually CPU bound.

Finally - if you are doing multiple analysis runs of the same data, it might be advisable to import the data into a real database and use that run queries. That would save you a lot of coding work.

Sign up to request clarification or add additional context in comments.

Comments

0

I will treat your question as: How to efficient perform a "join" operation of two files and here is the answer.

Actually there is a join command in Unix. http://linux.die.net/man/1/join

Suppose you have two files, student and student_with_country:

student: [name] [age] [...]
student_with_country: [name] [country] [...]

you can do:

join student student_with_country (by default, it will join based on the first field)

Then the question is how to make it faster by using multiple cores?

Answer is parallel command. Basically, you can run a simple map-reduce program using it. For example, in this case

cat student_with_country | parallel --block 10M --pipe join student - 

It will divide the student_with_country file into 10M blocks and run the join command in parallel. In this way, you can utilize power of multiple cores.

2 Comments

Thanks for the suggestion. This will probably work best if you have only two files and they have the same number of lines. The problem is that the file student_with_country has a different format from the student file and also has additional garbage that I dont need.
join assumes sorted files.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.