2

This should be easy but I've got errors that I can't work out. I've got some air pollution stats for the UK that I want to parse.

https://uk-air.defra.gov.uk/data/DAQI-regional-data?regionIds%5B%5D=999&aggRegionId%5B%5D=999&datePreset=6&startDay=01&startMonth=01&startYear=2022&endDay=01&endMonth=01&endYear=2023&queryId=&action=step2&go=Next+

But using read_html results in the error:

ParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 2

df = pd.read_html("https://uk-air.defra.gov.uk/data/DAQI-regional-data?regionIds%5B%5D=999&aggRegionId%5B%5D=999&datePreset=6&startDay=01&startMonth=01&startYear=2022&endDay=01&endMonth=01&endYear=2023&queryId=&action=step2&go=Next+")
df

This returns the data as a list. But I want to turn that list into a dataframe.

Which is the best way to solve the problem?

2
  • 3
    This is HTML, not CSV. Commented Apr 19, 2023 at 13:21
  • If I click the link it downloads as CSV. But thank you for taking the time to point that out. At least the community solved the problem. Commented Apr 19, 2023 at 13:34

3 Answers 3

5

read_html always returns a list of DataFrames even if there is only one. You need to index it.

pandas.read_html
Read HTML tables into a list of DataFrame objects.

Returns dfs A list of DataFrames.

df = pd.read_html("https://uk-air.defra.gov.uk/...")[0] # <-- add [0] at the end

Output :

print(df)
​
           Date  ...  West Yorkshire Urban Area
0    01/01/2022  ...                          2
1    02/01/2022  ...                          3
2    03/01/2022  ...                          3
..          ...  ...                        ...
362  29/12/2022  ...                          3
363  30/12/2022  ...                          3
364  31/12/2022  ...                          3

[365 rows x 33 columns]
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you sir. What is the purpose of the [0] by the way?
You're welcome. It is used to acces the first item (which is a DataFrame) of the list.
1

Panadas read_html actually handles such cases:

import pandas as pd

# Specify the URL of the HTML page containing the table
url = "..."

# Use the pandas read_html() method to read the table data into a list of dataframes
tables = pd.read_html(url)

# If there are multiple tables on the page, you can select the one you want by index
table = tables[0]

Comments

1

My code

import pandas as pd
url = "https://uk-air.defra.gov.uk/data/DAQI-regional-data?regionIds%5B%5D=999&aggRegionId%5B%5D=999&datePreset=6&startDay=01&startMonth=01&startYear=2022&endDay=01&endMonth=01&endYear=2023&queryId=&action=step2&go=Next+"
dfs = pd.read_html(url)
type(dfs)  # Output: list
len(dfs)  # Output: 1
df = pd.DataFrame(dfs)
type(df)  # Output: pandas.core.frame.DataFrame

df.columns
""" Output:
Index(['Date', 'Central Scotland', 'East Midlands', 'Eastern',
   'Greater London', 'Highland', 'North East', 'North East Scotland',
   'North Wales', 'North West & Merseyside', 'Northern Ireland',
   'Scottish Borders', 'South East', 'South Wales', 'South West',
   'West Midlands', 'Yorkshire & Humberside',
   'Belfast Metropolitan Urban Area', 'Brighton/Worthing/Littlehampton',
   'Bristol Urban Area', 'Cardiff Urban Area', 'Edinburgh Urban Area',
   'Glasgow Urban Area', 'Greater Manchester Urban Area',
   'Leicester Urban Area', 'Liverpool Urban Area', 'Nottingham Urban Area',
   'Portsmouth Urban Area', 'Sheffield Urban Area', 'Swansea Urban Area',
   'Tyneside', 'West Midlands Urban Area', 'West Yorkshire Urban Area'],
  dtype='object')
"""

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.