3

Here is a data I am interested in.

http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip

It consists of 3 files:

enter image description here

I want to download zip with pandas and create DataFrame from 1 file called Production_Crops_E_All_Data.csv

import pandas as pd
url="http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip"
df=pd.read_csv(url)

Pandas can download files, it can work with zips and of course it can work with csv files. But how can I work with 1 specific file in archive with many files?

Now I get error

ValueError: ('Multiple files found in compressed zip file %s)

This post doesn't answer my question bcause I have multiple files in 1 zip Read a zipped file as a pandas DataFrame

2
  • Does this answer your question? Read a zipped file as a pandas DataFrame Commented Jul 6, 2020 at 8:49
  • @Pasindu Gamarachchi no, the link you pointed to works well when the zip file contains only a single file, but the OP is talking about multiple files contained in a single zip file. Commented Aug 30, 2021 at 13:20

2 Answers 2

4

From this link

try this

from zipfile import ZipFile
import io
from urllib.request import urlopen
import pandas as pd

r = urlopen("http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip").read()
file = ZipFile(io.BytesIO(r))
data_df = pd.read_csv(file.open("Production_Crops_E_All_Data.csv"), encoding='latin1')
data_df_noflags = pd.read_csv(file.open("Production_Crops_E_All_Data_NOFLAG.csv"), encoding='latin1')
data_df_flags = pd.read_csv(file.open("Production_Crops_E_Flags.csv"), encoding='latin1')

Hope this helps! EDIT: updated for python3 StringIO to io.StringIO

EDIT: updated the import of urllib, changed usage of StringIO to BytesIO. Also your CSV files are not utf-8 encoding, I tried latin1 and that worked.

Sign up to request clarification or add additional context in comments.

4 Comments

import urllib should be edited to import urllib.request.
file = ZipFile(io.StringIO(r)) traceback: TypeError: initial_value must be str or None, not bytes
Thank you for updating the post but error 'TypeError: initial_value must be str or None, not bytes' still exists. //// .read().decode('utf8') doesn't help
Hey @IgorK.Updated the answer to fix that, please use BytesIO instead of StringIO Cheers!
1

You could use python's datatable, which is a reimplementation of Rdatatable in python.

Read in data :

from datatable import fread

#The exact file to be extracted is known, simply append it to the zip name:
 url = "Production_Crops_E_All_Data.zip/Production_Crops_E_All_Data.csv"

 df = fread(url)

#convert to pandas

 df.to_pandas()

You can equally work within datatable; do note however, that it is not as feature-rich as Pandas; but it is a powerful and very fast tool.

Update: You can use the zipfile module as well :

from zipfile import ZipFile
from io import BytesIO

with ZipFile(url) as myzip:
    with myzip.open("Production_Crops_E_All_Data.csv") as myfile:
        data = myfile.read()

#read data into pandas
#had to toy a bit with the encoding,
#thankfully it is a known issue on SO
#https://stackoverflow.com/a/51843284/7175713
df = pd.read_csv(BytesIO(data), encoding="iso-8859-1", low_memory=False)

1 Comment

for pity I cannot install this library using pip install. "SystemExit: Suitable C++ compiler cannot be determined. Please specify a compiler executable in the CXX environment variable."

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.