Removing lines from file python

Question

I have a file that looks something like

geneA geneB 134
geneC geneF 395
geneH geneD 958
geneF geneC 395
geneB geneA 134
geneD geneH 958

I would like to remove the lines that have the same genes (that are in opposite order) so I just get

geneA geneB 134
geneC geneF 395
geneH geneD 958

I have this so far, but I get even more duplicates when I try using replace() or an if not statement. Any ideas on how I could change this?

with open(filename, 'r') as handle, open(outfilename, 'a') as w:

    for line in handle:
        element = line.split()
        gene1 = element[0]
        gene2 = element[1]

        for line in handle:
            matchingelement = line.split()
            gene3 = matchingelement[0]
            gene4 = matchingelement[1]

            if gene3 == gene2 and gene4 == gene1:
                """Remove the line"""

By "remove this line" do you mean remove it and write the same file again or remove it and write the results in a new file? — Farhan.K
– Farhan.K, Commented Jun 14, 2016 at 14:53
Will lines with the same genes always have the same number in the end? — Bernardo Meurer
– Bernardo Meurer, Commented Jun 14, 2016 at 14:59
I'd like to write whatever is left to a new file. They will always have the same number, but I was trying to avoid using that just in case a connection between two genes happens to have the same value as a connection between two other genes. — Katie J
– Katie J, Commented Jun 14, 2016 at 15:06

tdelaney · Accepted Answer · 2016-06-14 15:37:19Z

3

Convert the genes into a hashable form that can be added to a set and check that set as you go along. In this example, I sorted the genes so that order doesn't matter and then build them back into a single "normalized" string.

filename = 'a.txt'
outfilename = 'aout.txt'

seen = set()

with open(filename, 'r') as handle, open(outfilename, 'a') as w:
    for line in handle:
        element = line.split()
        # a hashable "normalized" view of the genes
        genes = '-'.join(sorted(element[0:2]))
        if genes not in seen:
            seen.add(genes)
            w.write(line)

print(open(outfilename).read())

edited Jun 14, 2016 at 15:37

answered Jun 14, 2016 at 15:14

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Vladislav Martin Over a year ago

I can't seem to figure this out for some reason... element will only have 3 elements ['geneX', 'geneY', '123'], right? Why element[0:3]?

tdelaney Over a year ago

@VladislavMartin - you are right that should have been [0:2]. thanks!

Farhan.K Over a year ago

Even then isn't that the same as element? Because element only has 3 items so the [0:2] won't make a difference. I'm probably missing something simple haha

tdelaney Over a year ago

@Farhan.K - slices are open-ended on the right so 0:2 selects elements 0 and 1. As an example, ['geneA', 'geneB', '134'][0:2] gives you ['geneA', 'geneB'].

Farhan.K Over a year ago

@tdelaney I can't believe I forgot that. Knew it was something simple, thanks

Collectives™ on Stack Overflow

Removing lines from file python

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related