0

I have a file that looks something like

geneA geneB 134
geneC geneF 395
geneH geneD 958
geneF geneC 395
geneB geneA 134
geneD geneH 958

I would like to remove the lines that have the same genes (that are in opposite order) so I just get

geneA geneB 134
geneC geneF 395
geneH geneD 958    

I have this so far, but I get even more duplicates when I try using replace() or an if not statement. Any ideas on how I could change this?

with open(filename, 'r') as handle, open(outfilename, 'a') as w:

    for line in handle:
        element = line.split()
        gene1 = element[0]
        gene2 = element[1]

        for line in handle:
            matchingelement = line.split()
            gene3 = matchingelement[0]
            gene4 = matchingelement[1]

            if gene3 == gene2 and gene4 == gene1:
                """Remove the line"""
3
  • By "remove this line" do you mean remove it and write the same file again or remove it and write the results in a new file? Commented Jun 14, 2016 at 14:53
  • 3
    Will lines with the same genes always have the same number in the end? Commented Jun 14, 2016 at 14:59
  • I'd like to write whatever is left to a new file. They will always have the same number, but I was trying to avoid using that just in case a connection between two genes happens to have the same value as a connection between two other genes. Commented Jun 14, 2016 at 15:06

1 Answer 1

3

Convert the genes into a hashable form that can be added to a set and check that set as you go along. In this example, I sorted the genes so that order doesn't matter and then build them back into a single "normalized" string.

filename = 'a.txt'
outfilename = 'aout.txt'

seen = set()

with open(filename, 'r') as handle, open(outfilename, 'a') as w:
    for line in handle:
        element = line.split()
        # a hashable "normalized" view of the genes
        genes = '-'.join(sorted(element[0:2]))
        if genes not in seen:
            seen.add(genes)
            w.write(line)

print(open(outfilename).read())
Sign up to request clarification or add additional context in comments.

5 Comments

I can't seem to figure this out for some reason... element will only have 3 elements ['geneX', 'geneY', '123'], right? Why element[0:3]?
@VladislavMartin - you are right that should have been [0:2]. thanks!
Even then isn't that the same as element? Because element only has 3 items so the [0:2] won't make a difference. I'm probably missing something simple haha
@Farhan.K - slices are open-ended on the right so 0:2 selects elements 0 and 1. As an example, ['geneA', 'geneB', '134'][0:2] gives you ['geneA', 'geneB'].
@tdelaney I can't believe I forgot that. Knew it was something simple, thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.