2

RESOLVED: Problem had to do with Python version, refer to stackoverflow.com/a/5513856/2540382

I am fiddling with htm -> txt file conversion and am having a little trouble. My project is essentially to convert the messages.htm file I downloaded of my Facebook chat history into a messages.txt file with all the <> brackets removed and formatting preserved.

The file messages.htm is parsed into variable text.

I then run:

target = open('output.txt', 'w')
target.write(text)
target.close

This seems to work except when I hit an invalid character. As seen in the error below. Is there a way to either:

  1. Skip the line with the invalid character while writing?

  2. Figure out where the invalid characters are and remove the corresponding character or line?

The desired outcome is to avoid having strange characters all together if possible.

return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U000fe333' in position 37524: character
maps to <undefined>
3
  • Tip(but not the problem): file.close is a function so you have to use target.close() to call it. Commented Oct 31, 2015 at 5:07
  • Can you give an example of invalid characters? Commented Oct 31, 2015 at 5:16
  • @flamenco, they give the example '\U000fe333'. Commented Oct 31, 2015 at 5:18

1 Answer 1

3
target = open('output.txt', 'wb')
target.write(text.encode('ascii', 'ignore'))
target.close()

For the "errors" argument to .encode(..), 'ignore' will strip out those characters, and 'replace' will replace them with '?'.

To test this, I replaced the write line with

target.write(u"foo\U000fe333bar".encode("ascii", "ignore"))

and confirmed that output.txt contained only "foobar".

UPDATE: I edited the open(.., 'w') to open(.., 'wb') to make sure this would work in Python 3 as well.

Sign up to request clarification or add additional context in comments.

5 Comments

Hmm I get this error: File "html2text.py", line 693, in wrapwrite target.write(text.encode('ascii', 'ignore')) TypeError: write() argument must be str, not bytes
What type is "text"? I tested it with a string and Python 2.7.10.
Sorry for the delayed response, I added print(type(text)) to the code. Cmd is telling me that the type is string. C:\Users\kevin\Desktop\workspace>python html2text.py part1.htm <class 'str'>
It appears to be a change between Python 2 and Python 3. stackoverflow.com/a/5513856/2540382
not woork for me f.write(towrite.encode("ascii", "ignore")) TypeError: write() argument must be str, not bytes

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.