PYTHON Tensorflow, Text analyzing: Non-ASCII character '\xc3' in file

Question

I have the most basic knowledge with Python and working on tweets analyzing API. I found a NLP tutorial where it uses T-SNE and word2vec. Reference to my system posted on Stackoverflow before.

I followed the tutorial step-by-step, but upon running the code, I encountered an error:

Non-ASCII character '\xc3' in file

Is there a reason to this? Code snippet is as below.

def process_raw_data(input_file):
  valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
  url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
  name_match = "\@[\_0-9a-zA-Z]+\:?"
  lines = []
  print("Loading raw data from: " + input_file)
  if os.path.exists(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as f:
      lines = f.readlines()
  num_lines = len(lines)
  ret = []
  for count, text in enumerate(lines):
    if count % 50 == 0:
      print_progress(count, num_lines)
    text = re.sub(url_match, u"", text)
    text = re.sub(name_match, u"", text)
    text = re.sub("\&amp\;?", u"", text)
    text = re.sub("[\:\.]{1,}$", u"", text)
    text = re.sub("^RT\:?", u"", text)
    text = u''.join(x for x in text if x in valid)
    text = text.strip()
    if len(text.split()) > 5:
      if text not in ret:
        ret.append(text)
  return ret

We can't know the encoding of a file whose contents you are not showing. See also meta.stackoverflow.com/questions/379403/… — tripleee
– tripleee, Commented Dec 14, 2020 at 7:29

Andrey · Accepted Answer · 2020-12-14 07:18:50Z

1

Your input_file probably has different encoding (not utf-8).

answered Dec 14, 2020 at 7:18

Andrey

6,6393 gold badges24 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PYTHON Tensorflow, Text analyzing: Non-ASCII character '\xc3' in file

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related