0

My attempt to remove arrow character, blank lines and headers from this text file is as below -

I am trying to ignore arrow character and blank lines and write in the new file MICnew.txt but my code doesn't do it. Nothing changes in the new file. Please help, Thanks so much I have attached sample file as well.

import re
with open('MIC.txt') as oldfile, open('MICnew.txt', 'w') as newfile:
    for line in oldfile:
        newfile.write(re.sub(r'[^\x00-\x7f]',r' ',line))

with open('MICnew.txt','r+') as file:
    for line in file:
        if not line.isspace():
            file.write(line)
2
  • 1
    Please extend your error description beyond "it doesn't work"! Commented Nov 17, 2021 at 4:54
  • @KlausD. Done, Thanks Commented Nov 17, 2021 at 4:57

1 Answer 1

1

You can't read from and write to the same file simultaneously. When you open a file with mode r+, the I/O pointer is initially at the beginning but reading will push it to the end (as explained in this answer). So in your case, you read the first line of the file, which moves the pointer to the end of the file. Then you write out that line (unless it's all whitespace) but crucially, the pointer stays at the end. That means on the next iteration of the loop you will have reached the end of the file and your program stops.

To avoid this, read in all the contents of the file first, then loop over them and write out what you want:

file_data = Path('MICnew.txt').read_text()

with open('MICnew.txt', 'w') as out_handle: # THIS WILL OVERWRITE THE FILE!
    for line in file_data.splitlines():
        if not line.isspace():
            file.write(line)

But that double loop is a bit clumsy and you can instead combine the two steps into one:

with open('MIC.txt', errors='ignore') as oldfile,
     open('MICnew.txt', 'w') as newfile:

    for line in oldfile:
        clean_line = re.sub(r'[^\x00-\x7f]', ' ', line.strip('\x0c'))
        if not clean_line.isspace():
            newfile.write(clean_line)

In order to remove non-Unicode characters, the file is opened with errors='ignore' which will omit the improperly encoded characters. Since the sample file contains a number of rogue form feed characters throughout, it explicitly removes them (ASCII code 12 or \x0c in hex).

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks Jan. I tried the second part of your answer(merged loops) but it did not remove special character or blank line. Could you try your code in the sample file I have attached with the question (hyperlink) Thanks
@CodeBot I updated my answer and tested against your sample file. Since it has a number of formfeed characters, I remove them explicitly.
Superb! This worked now. Thanks @Jan Really appreciate it. I have marked it as answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.