Replace an arrow character, repeating headers and blank lines in text file and paste the data cleanly in Excel sheet

Question

My attempt to remove arrow character, blank lines and headers from this text file is as below -

I am trying to ignore arrow character and blank lines and write in the new file MICnew.txt but my code doesn't do it. Nothing changes in the new file. Please help, Thanks so much I have attached sample file as well.

import re
with open('MIC.txt') as oldfile, open('MICnew.txt', 'w') as newfile:
    for line in oldfile:
        newfile.write(re.sub(r'[^\x00-\x7f]',r' ',line))

with open('MICnew.txt','r+') as file:
    for line in file:
        if not line.isspace():
            file.write(line)

Please extend your error description beyond "it doesn't work"! — Klaus D.
– Klaus D., Commented Nov 17, 2021 at 4:54

Jan Wilamowski · Accepted Answer · 2021-11-19 03:27:05Z

1

You can't read from and write to the same file simultaneously. When you open a file with mode r+, the I/O pointer is initially at the beginning but reading will push it to the end (as explained in this answer). So in your case, you read the first line of the file, which moves the pointer to the end of the file. Then you write out that line (unless it's all whitespace) but crucially, the pointer stays at the end. That means on the next iteration of the loop you will have reached the end of the file and your program stops.

To avoid this, read in all the contents of the file first, then loop over them and write out what you want:

file_data = Path('MICnew.txt').read_text()

with open('MICnew.txt', 'w') as out_handle: # THIS WILL OVERWRITE THE FILE!
    for line in file_data.splitlines():
        if not line.isspace():
            file.write(line)

But that double loop is a bit clumsy and you can instead combine the two steps into one:

with open('MIC.txt', errors='ignore') as oldfile,
     open('MICnew.txt', 'w') as newfile:

    for line in oldfile:
        clean_line = re.sub(r'[^\x00-\x7f]', ' ', line.strip('\x0c'))
        if not clean_line.isspace():
            newfile.write(clean_line)

In order to remove non-Unicode characters, the file is opened with errors='ignore' which will omit the improperly encoded characters. Since the sample file contains a number of rogue form feed characters throughout, it explicitly removes them (ASCII code 12 or \x0c in hex).

edited Nov 19, 2021 at 3:27

answered Nov 17, 2021 at 9:20

Jan Wilamowski

3,6472 gold badges13 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Shri Over a year ago

Thanks Jan. I tried the second part of your answer(merged loops) but it did not remove special character or blank line. Could you try your code in the sample file I have attached with the question (hyperlink) Thanks

Jan Wilamowski Over a year ago

@CodeBot I updated my answer and tested against your sample file. Since it has a number of formfeed characters, I remove them explicitly.

Shri Over a year ago

Superb! This worked now. Thanks @Jan Really appreciate it. I have marked it as answer.

Collectives™ on Stack Overflow

Replace an arrow character, repeating headers and blank lines in text file and paste the data cleanly in Excel sheet

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related