0

By reading csv file with python pandas, and try to change encoding, because of some Germans letters, seams Azure always keep same encoding (assuming default).

Whatever I've done, always get same error on Azure portal: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte Stack

Same error appears even if I set, uft-16, latin1, cp1252 etc.

with pysftp.Connection(host, username=username, password=password, cnopts=cnopts) as sftp:
  for i in sftp.listdir_attr():
     with sftp.open(i.filename) as f:
        df = pd.read_csv(f, delimiter=';', encoding='cp1252')

By the way, testing this locally on windows machine, it works fine.

Full error:

Result: Failure Exception: UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 0: invalid continuation byte Stack: File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 355, in _handle__invocation_request call_result = await self._loop.run_in_executor( 
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", 
line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 542, in __run_sync_func return func(**params) 
File "/home/site/wwwroot/ce_etl/etl_main.py", 
line 141, in main df = pd.read_csv(f, delimiter=';', encoding=r"utf-8-sig", error_bad_lines=False) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/util/_decorators.py", 
line 311, in wrapper return func(*args, **kwargs) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 586, in read_csv return _read(filepath_or_buffer, kwds) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 488, in _read return parser.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 1047, in read index, columns, col_dict = self._engine.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py", 
line 223, in read chunks = self._reader.read_low_memory(nrows) 
File "pandas/_libs/parsers.pyx", 
line 801, in pandas._libs.parsers.TextReader.read_low_memory 
File "pandas/_libs/parsers.pyx", 
line 880, in pandas._libs.parsers.TextReader._read_rows 
File "pandas/_libs/parsers.pyx", 
line 1026, in pandas._libs.parsers.TextReader._convert_column_data 
File "pandas/_libs/parsers.pyx", 
line 1080, in pandas._libs.parsers.TextReader._convert_tokens 
File "pandas/_libs/parsers.pyx", 
line 1204, in pandas._libs.parsers.TextReader._convert_with_dtype 
File "pandas/_libs/parsers.pyx", 
line 1217, in pandas._libs.parsers.TextReader._string_convert 
File "pandas/_libs/parsers.pyx", 
line 1396, in pandas._libs.parsers._string_box_utf8
7
  • first you could check what encoding use code 0xc4 - maube it is different encoding then you expect. Commented Nov 7, 2021 at 20:02
  • when I test b'\xc4'.decode('cp1252') or b'\xc4'.decode('latin1') then I get Ä' . Maybe you have problem in different place - better show FULL error message in question (not in comments). And show original code which generates this error. maybe you sete encoding in wrong line or in wrong file and Azure runs all time wrong code. Commented Nov 7, 2021 at 20:04
  • Yes, Ä is what need to be decoded. File has ANSI encoding. I think in that case cp1252 should work, and it works but locally on windows machine. On Azure, same code doesn't work. Commented Nov 8, 2021 at 8:47
  • I've updated question with more details. Commented Nov 8, 2021 at 8:54
  • see again your error message - it shows main df = pd.read_csv(..., encoding=r"utf-8-sig") - so you runs different code and it still use utf-8-sig Commented Nov 8, 2021 at 12:25

2 Answers 2

0

You can use encoding as below:

read_csv('file', encoding = "ISO-8859-1")

Also if we would like to detect the own encoding of the file and place it in read_csv, we can add it as below:

result = chardet.detect(f.read()) #or readline if the file is large
df=pd.read_csv(r'C:\test.csv',encoding=result['encoding'])

Refer to read_csv from Python Pandas documentations

Sign up to request clarification or add additional context in comments.

1 Comment

This one was really good point. It returns ascii. I've added encoding dynamically as you suggested but still returns same error: 'UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4...'. Also tried with encoding = "ISO-8859-1" and got same.
0

I found solution. Basically sftp.open keeps utf-8 by default. Why Azure Linux can't change encoding in read_csv method is still remaining a question.

Reading file as object with sftp.getfo, and then parsing to df would work fine:

 ba = io.BytesIO()
 sftp.getfo(i.filename, ba)
 ba.seek(0)

 f = io.TextIOWrapper(ba, encoding='cp1252')
 df = pd.read_csv(f, delimiter=';', encoding='cp1252', dtype=str, 
                  error_bad_lines=False)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.