2

I have a file with tab separated values such as:

"1" "12345" "abc" "def"
"2" "67890" "abc" "ghi"
"3" "13578" "jkl" "mno"

I can't figure out how to take arbitrary numbers from an input file and, if the first 5 digits match what is in the second column of the input file, then every thing on that line will be exported into another file.

Ex: input file: "67890123"

output file: "2"   "67890"   "abc"   "ghi"
5
  • 6
    What's your code so far and where is it not working. Commented Sep 14, 2017 at 19:36
  • 1
    how big is the first file? I would suggest reading each line into a dictionary where the 5 digit number is the key: {"12345": ("1", "12345", "abc", "def"), "67890": ("2", "67890", ...)...} then simply index into the dictionary with the first 5 digits of the input. Commented Sep 14, 2017 at 19:42
  • This is a basic problem that python and you can handle very easily. read the file line by line split the line, creating a list if condition true: ---- store the required data into a data structure When done reading, write the data structure to a file This example may help: interactivepython.org/runestone/static/thinkcspy/Files/… Commented Sep 14, 2017 at 19:44
  • The input file is 11 MB and contains over 150k lines. I'm not a Python programmer and don't know how to accomplish this. Commented Sep 14, 2017 at 19:44
  • @ShawnSharp, Do you want to generate a single output file or as many output files as entries has the input file? Commented Sep 14, 2017 at 19:58

4 Answers 4

2

You can use pandas package to read and write your data file.

from __future__ import with_statement
import pandas as pd

inputFileName = "D:/tmp/inputfile.txt"
dataFileName = "D:/tmp/data.csv"
outputFileName = "D:/tmp/outputfile.txt"

data = pd.read_csv(dataFileName, sep=' ', header=None)

with open(inputFileName) as f:
    input = f.readlines()
input = [int(x[0:5]) for x in input]

output = pd.DataFrame()
for value in input:
    output = output.append(data[data[data.columns[1]] == value])

output.to_csv(outputFileName, sep=' ', header=None, index=False)

So if your input file has

67890123
13578010

And your data is

"1" "12345" "abc" "def"
"2" "67890" "abc" "ghi"
"3" "13578" "jkl" "mno"

The output file would be:

2 67890 abc ghi
3 13578 jkl mno
Sign up to request clarification or add additional context in comments.

Comments

1

try this:

import os, re
import argparse as ap

p = ap.ArgumentParser()
p.add_argument('-i', '--input', required = True)
args = p.parse_args()

with open('file.txt', 'r') as f:
    for value in f.read().split('\n'):
        if str(re.split(r'\s+',value)[1]).replace('"', '') == args.input[:5]:
            open('output.txt', 'w').write(value)

Comments

1
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', required = True)
args = parser.parse_args()

with open('input.txt') as file:
    entries = file.readlines()
    ## Do not remove new line character at end as it will be useful to print new lines.

with open('output.txt', 'w') as file:
    for entry in entries:
        components = entry.split('\t')
        if components[1][1: 6] == args.input[:5]:
            # Note indexing of slicing starts from 1 to 6. Reason for that is there is 
            # explicit quote symbol present in input. 
            file.write(entry)

To run this code: > python my_file.py --input='67890'

The code is self explanatory, let me know if you need more explanation.

Comments

1

E2A: Multiple inputs..

Assuming you have loaded the input from a tsv file

You use simple boolean comparison

the simple Python way is:

import csv
input = ['67890231', '12345065']


with open("so.tsv") as tsv:
    for line in csv.reader(tsv, dialect="excel-tab"):
        for item in line:
            match = [line for x in line if x == item[:5]]

        print(match)

returns:

[['1', '12345', 'abc']]
[['2', '67890', 'def']]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.