1

I have a URL that I am having difficulty reading. It is uncommon in the sense that it is data that I have self-generated or in other words have created using my own inputs. I have tried with other queries to use something like this and it works fine but not in this case:

bst = pd.read_csv('https://psl.noaa.gov/data/correlation/censo.data', skiprows=1, 
skipfooter=2,index_col=[0], header=None,
             engine='python', # c engine doesn't have skipfooter
             delim_whitespace=True)

Here is the code + URL that is providing the challenge:

zwnd = pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl? 
ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
                 engine='python', # c engine doesn't have skipfooter
                 delim_whitespace=True)

Thank you for any help that you can provide.

Here is the full error message:

pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
                 engine='python', # c engine doesn't have skipfooter
                 delim_whitespace=True)
Traceback (most recent call last):

  Cell In[240], line 1
    pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\util\_decorators.py:211 in wrapper
    return func(*args, **kwargs)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\util\_decorators.py:331 in wrapper
    return func(*args, **kwargs)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:950 in read_csv
    return _read(filepath_or_buffer, kwds)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:611 in _read
    return parser.read(nrows)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:1778 in read
    ) = self._engine.read(  # type: ignore[attr-defined]

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:282 in read
    alldata = self._rows_to_cols(content)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:1045 in _rows_to_cols
    self._alert_malformed(msg, row_num + 1)

  File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:765 in _alert_malformed
    raise ParserError(msg)

ParserError: Expected 2 fields in line 133, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
2
  • I think you need to use BeautifulSoup or something similar to first extract the structured data from the page; the URL you've provided is not a raw CSV file, but an HTML file/page, and pd.read_csv is not able to directly extract the data from the appropriate HTML tags/elements. Commented Feb 2, 2023 at 19:03
  • which one is the line 133? Commented Feb 2, 2023 at 19:13

2 Answers 2

1

pd.read_csv does not parse HTML. You might try pd.read_html, but would find that it works on <table> tags, not <pre> tags.

On inspecting the HTML content of the given URL, it is evident that the data is contained in a <pre> tag.

Use something like requests to get the page content, and BeautifulSoup4 to parse the HTML page contents (with an appropriate parsing engine, either lxml or html5lib). Then pull out the content of the <pre> tag, splitting on newlines, slicing to ignore unwanted lines, and then splitting on whitespace.


Minimal working code:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries'
res = requests.get(url)

# get the text from the 'pre' tag, split it on newlines
# slice off 1 head and 5 tail rows
# (inspect the contents of 'soup.find('pre').text' to determine correct values)
soup = BeautifulSoup(res.content, "html5lib")
data = soup.find('pre').text.split("\n")[1:-5]

df = pd.DataFrame([row.split() for row in data]).apply(pd.to_numeric)
df = df.set_index(df.iloc[:,0])

results in

>>> print(df.head(5))
        0      1      2      3      4      5      6      7      8      9      10     11     12
0
1948  1948  0.878  0.779  0.851  0.393  0.461  0.747  0.867  0.539 -0.106  0.045  0.819  1.506
1949  1949  0.386  1.197  1.154  1.054  0.358  0.645  0.643  0.477  0.128 -0.091  1.500  0.390
1950  1950  0.674  0.973  1.640  0.821  0.572  1.002  0.635  0.196 -0.020  0.268  0.844  1.045
1951  1951  1.524  0.698  0.971  0.790  0.789  0.587  0.682  0.238  0.256  0.035  0.906  1.268
1952  1952  1.524  1.510  1.353  0.705  0.710  1.188  0.412  0.432 -0.091  0.415  0.443  1.509

and

>>> print(df.dtypes)
0       int64
1     float64
2     float64
...
12    float64

This answer is a good starting point for what you're trying to accomplish.

Sign up to request clarification or add additional context in comments.

1 Comment

thank you, your answer above helped me with the starting point. i used this for the soup variable and it works after defining "s" as requests.get(url).content and soup = BeautifulSoup(s, "html.parser").
0

Its because the first one directly points to a dataset from storage in .data format but the second url points to a website (which is made up of html, css, json, etc files). You can only use pd.read_csv if you are parsing in a .csv file, and i guess a .data file too since it worked for you.


If you can find a link to the actual .data or .csv file on that website you will be able to parse it no problem. Since its a gov website, they probably will have a good file format.


If you cannot, and still need this data you will have to do some webscraping from that website (like using selenium), then you will need to store them as dataframes, and maybe preprocess it so it gets added like expected.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.