0

I have an input file (input.txt) which contains some data that follows a standard format similar to the following lines:

<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Politische Inklusion"@de .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Political inclusion"@en .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Radiologische Kampfmittel"@de . 

I want to extract a list of English strings which lies between the " "@en in outputfile-en.txt, and German strings which lies between the " "@de in outputfile-de.txt

In this example outputfile-en.txt should contain:

Political inclusion 

and outputfile-de.txt should contain:

Politische Inklusion
Radiologische Kampfmittel 

Which regex is suitable here?

2 Answers 2

3

With such a simple pattern there's no need for regex at all, especially not to re-iterate over the same data to pick up different languages - you can stream parse and write your results on the fly:

with open("input.txt", "r") as f:  # open the input file
    file_handles = {}  # a map of our individual output file handles
    for line in f:  # read it line by line
        rindex = line.rfind("@")  # find the last `@` character
        language = line[rindex+1:rindex+3]  # grab the following two characters as language
        if rindex != -1:  # char found, consider the line...
            lindex = line.rfind("\"", 0, rindex-1)  # find the preceding quotation
            if lindex != -1:  # found, we have a match
                if language not in file_handles:  # add a file handle for this language:
                    file_handles[language] = open("outputfile-{}.txt".format(language), "w")
                # write the found slice between `lindex` and `rindex` + a new line
                file_handles[language].write(line[lindex+1:rindex-1] + "\n")
    for handle in file_handles.values():  # lets close our output file handles
        handle.close()

Should be significantly faster than regex + as a bonus is that it will work with any language so if you have ...@it lines it will save outputfile-it.txt as well.

Sign up to request clarification or add additional context in comments.

Comments

1

You could do something like this:

import re

str = """<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Politische Inklusion"@de .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Political inclusion"@en .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Radiologische Kampfmittel"@de . """

german = re.compile('"(.*)"@de')
english = re.compile('"(.*)"@en')

print german.findall(str)
print english.findall(str)

This would give you ['Politische Inklusion', 'Radiologische Kampfmittel'] and ['Political inclusion']. Now you only have to iterate over those results and write them to the appropriate file.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.