When python pandas.read_csv on azure, encoding is not changing

Question

By reading csv file with python pandas, and try to change encoding, because of some Germans letters, seams Azure always keep same encoding (assuming default).

Whatever I've done, always get same error on Azure portal: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte Stack

Same error appears even if I set, uft-16, latin1, cp1252 etc.

with pysftp.Connection(host, username=username, password=password, cnopts=cnopts) as sftp:
  for i in sftp.listdir_attr():
     with sftp.open(i.filename) as f:
        df = pd.read_csv(f, delimiter=';', encoding='cp1252')

By the way, testing this locally on windows machine, it works fine.

Full error:

Result: Failure Exception: UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 0: invalid continuation byte Stack: File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 355, in _handle__invocation_request call_result = await self._loop.run_in_executor( 
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", 
line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 542, in __run_sync_func return func(**params) 
File "/home/site/wwwroot/ce_etl/etl_main.py", 
line 141, in main df = pd.read_csv(f, delimiter=';', encoding=r"utf-8-sig", error_bad_lines=False) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/util/_decorators.py", 
line 311, in wrapper return func(*args, **kwargs) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 586, in read_csv return _read(filepath_or_buffer, kwds) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 488, in _read return parser.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 1047, in read index, columns, col_dict = self._engine.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py", 
line 223, in read chunks = self._reader.read_low_memory(nrows) 
File "pandas/_libs/parsers.pyx", 
line 801, in pandas._libs.parsers.TextReader.read_low_memory 
File "pandas/_libs/parsers.pyx", 
line 880, in pandas._libs.parsers.TextReader._read_rows 
File "pandas/_libs/parsers.pyx", 
line 1026, in pandas._libs.parsers.TextReader._convert_column_data 
File "pandas/_libs/parsers.pyx", 
line 1080, in pandas._libs.parsers.TextReader._convert_tokens 
File "pandas/_libs/parsers.pyx", 
line 1204, in pandas._libs.parsers.TextReader._convert_with_dtype 
File "pandas/_libs/parsers.pyx", 
line 1217, in pandas._libs.parsers.TextReader._string_convert 
File "pandas/_libs/parsers.pyx", 
line 1396, in pandas._libs.parsers._string_box_utf8

first you could check what encoding use code 0xc4 - maube it is different encoding then you expect. — furas
– furas, Commented Nov 7, 2021 at 20:02
when I test b'\xc4'.decode('cp1252') or b'\xc4'.decode('latin1') then I get Ä' . Maybe you have problem in different place - better show FULL error message in question (not in comments). And show original code which generates this error. maybe you sete encoding in wrong line or in wrong file and Azure runs all time wrong code. — furas
– furas, Commented Nov 7, 2021 at 20:04
Yes, Ä is what need to be decoded. File has ANSI encoding. I think in that case cp1252 should work, and it works but locally on windows machine. On Azure, same code doesn't work. — choka
– choka, Commented Nov 8, 2021 at 8:47
see again your error message - it shows main df = pd.read_csv(..., encoding=r"utf-8-sig") - so you runs different code and it still use utf-8-sig — furas
– furas, Commented Nov 8, 2021 at 12:25

SaiKarri-MT · Accepted Answer · 2021-11-09 09:19:00Z

0

You can use encoding as below:

read_csv('file', encoding = "ISO-8859-1")

Also if we would like to detect the own encoding of the file and place it in read_csv, we can add it as below:

result = chardet.detect(f.read()) #or readline if the file is large
df=pd.read_csv(r'C:\test.csv',encoding=result['encoding'])

Refer to read_csv from Python Pandas documentations

answered Nov 9, 2021 at 9:19

SaiKarri-MT

1,3011 gold badge5 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

choka Over a year ago

This one was really good point. It returns ascii. I've added encoding dynamically as you suggested but still returns same error: 'UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4...'. Also tried with encoding = "ISO-8859-1" and got same.

choka · Accepted Answer · 2021-11-16 14:03:45Z

0

I found solution. Basically sftp.open keeps utf-8 by default. Why Azure Linux can't change encoding in read_csv method is still remaining a question.

Reading file as object with sftp.getfo, and then parsing to df would work fine:

 ba = io.BytesIO()
 sftp.getfo(i.filename, ba)
 ba.seek(0)

 f = io.TextIOWrapper(ba, encoding='cp1252')
 df = pd.read_csv(f, delimiter=';', encoding='cp1252', dtype=str, 
                  error_bad_lines=False)

answered Nov 16, 2021 at 14:03

choka

306 bronze badges

Collectives™ on Stack Overflow

When python pandas.read_csv on azure, encoding is not changing

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related