1

I have csv data that looks like this and I'm trying to read it into a pandas df and I've tired all sorts of combinations given the ample documentation online - I've tried things like:

pd.read_csv("https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv", delimiter=',', skiprows=0, low_memory=False)

and I get this error -

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 989

Or, like this but get an empty dataframe:

pd.read_csv('https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv', skiprows=2, 
skipfooter=3,index_col=[0], header=None,
             engine='python', # c engine doesn't have skipfooter
             sep='delimiter')
Out[31]: 
Empty DataFrame
Columns: []
Index: []

The first 10 lines of the csv file look like this:

# Water Supply Monthly Volumes for COLUMBIA - THE DALLES DAM (TDAO3) 
# Volumes are in KAF 
ID,Calendar Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
TDAO3,1948,,,,,,,,,,6866.8,4307.04,4379.38
TDAO3,1949,3546.71,4615.1,8513.31,15020.45,35251.67,21985.99,11226.06,6966.73,4727.37,4406.29,5266.74,5595.91
TDAO3,1950,4353.86,5540.21,9696.27,12854.81,23359.51,39246.78,23393.23,9676.77,5729.74,6990.31,8300.03,8779.57
TDAO3,1951,8032.32,10295.98,7948.59,16144.8,36000.88,28334.09,19735.49,9308.15,6546.95,8907.1,6461.14,6425.76
TDAO3,1952,4671,6222.25,6551.62,18678.3,34866.91,27120.65,15994.18,7907.55,4810.39,3954.32,3259.29,3231.49
TDAO3,1953,7839.72,7870.96,6527.74,9474.66,23384.47,32668.32,17422.63,8655.16,5220.04,5130.46,5183.5,5915.14
TDAO3,1954,5197.51,5967.07,6718.36,10813.69,29190.37,32673.26,29624.38,13456.13,9165.78,5440.92,5732.22,4973.53

thank you,

6
  • always put full error message (starting at word "Traceback") in question (not in comments) as text (not screenshot, not link to external portal). There are other useful information. Commented Feb 26, 2022 at 0:58
  • code works for me if I use first version with skiprows=2 Commented Feb 26, 2022 at 1:00
  • when I run second version without sep='delimiter' then I see some data but they different. Commented Feb 26, 2022 at 1:02
  • When I check again first version then I see <br> in my data - it seems it is not pure CSV but HTML displaying data. And this can make problem. maybe it needs sep='<br>' Commented Feb 26, 2022 at 1:02
  • yeah, i end up with a pandas of shape (0, 989) and I'm not sure what to do with that. Commented Feb 26, 2022 at 1:05

1 Answer 1

1

It is not link directly to file CSV but to page which displays it as HTML using tags <pre>, <br>, etc. and this makes problem.

But you can use requests to download it as text.

Later you can use standard string-functions to get text between <pre> and </pre> and replace <br> with '\n' - and you will have text with correct CSV.

And later you can use io.StringIO to create file in memory - to load it with pd.read_csv() without saving on disk.

import pandas as pd
import requests
import io

url = "https://www.nwrfc.noaa.gov/natural/nat_norm_text.cgi?id=TDAO3.csv"

response = requests.get(url)

start = response.text.find('<pre>') + len('<pre>')
end   = response.text.find('</pre>')

pre = response.text[start:end]

text = pre.replace('<br>', '\n')

buf = io.StringIO(text)  # file-like object in memory

df = pd.read_csv(buf, skiprows=2, low_memory=False)

print(df.to_string())

Result

      ID  Calendar Year       Jan       Feb       Mar       Apr       May       Jun       Jul       Aug      Sep       Oct      Nov       Dec
0   TDAO3           1948       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN      NaN   6866.80  4307.04   4379.38
1   TDAO3           1949   3546.71   4615.10   8513.31  15020.45  35251.67  21985.99  11226.06   6966.73  4727.37   4406.29  5266.74   5595.91
2   TDAO3           1950   4353.86   5540.21   9696.27  12854.81  23359.51  39246.78  23393.23   9676.77  5729.74   6990.31  8300.03   8779.57
3   TDAO3           1951   8032.32  10295.98   7948.59  16144.80  36000.88  28334.09  19735.49   9308.15  6546.95   8907.10  6461.14   6425.76
4   TDAO3           1952   4671.00   6222.25   6551.62  18678.30  34866.91  27120.65  15994.18   7907.55  4810.39   3954.32  3259.29   3231.49
5   TDAO3           1953   7839.72   7870.96   6527.74   9474.66  23384.47  32668.32  17422.63   8655.16  5220.04   5130.46  5183.50   5915.14
6   TDAO3           1954   5197.51   5967.07   6718.36  10813.69  29190.37  32673.26  29624.38  13456.13  9165.78   5440.92  5732.22   4973.53
7   TDAO3           1955   4124.26   3570.41   3843.46   7993.82  18505.47  31619.54  20408.54   8922.94  4983.31   5842.70  6982.45   9076.44
8   TDAO3           1956   8079.70   5366.62   8818.69  19754.46  40600.06  40447.34  19846.89   9726.93  5503.69   5446.20  4988.98   6006.80
9   TDAO3           1957   3940.08   4411.33   9155.00  12271.77  40111.86  27864.70  11585.75   6795.70  4613.31   4767.38  4087.55   4789.04
10  TDAO3           1958   4838.12   8246.89   7303.03  13902.66  33958.88  26239.62  12516.52   6898.78  4968.03   5198.19  6662.24   7616.43

... rest ...
Sign up to request clarification or add additional context in comments.

2 Comments

if they would use <table> to display data then you could read it with pd.read_html(url) and it would convert all <table> to DataFrame
this is amazing- thank you very much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.