0

I'm trying to read data from https://download.bls.gov/pub/time.series/bp/bp.measure using pandas, like this:

import pandas as pd

url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t')

However, I just need to get the dataset with the two columns: measure_code and measure_text. As this dataset as a title BP measure I also tried to read it like:

url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t', skiprows=1)

But in this case it returns a dataset with just one column and I'm not being able to slipt it:

>>> df.columns
Index([' measure_code         measure_text'], dtype='object')

Any suggestion/idea on a better approach to get this dataset?

1 Answer 1

1

It's definitely possible, but the format has a few quirks.

  1. As you noted, the column headers start on line 2, so you need skiprows=1.
  2. The file is space-separated, not tab-separated.
  3. Column values are continued across multiple lines.

Issues 1 and 2 can be fixed using skiprows and sep. Problem 3 is harder, and requires you to preprocess the file a little. For that reason, I used a slightly more flexible way of fetching the file, using the requests library. Once I have the file, I use regular expressions to fix problem 3, and give the file back to pandas.

Here's the code:

import requests
import re
import io
import pandas as pd

url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'

# Get the URL, convert the document from DOS to Unix linebreaks
measure_codes = requests.get(url) \
    .text \
    .replace("\r\n", "\n")

# If there's a linebreak, followed by at least 7 spaces, combine it with
# previous line
measure_codes = re.sub("\n {7,}", " ", measure_codes)

# Convert the string to a file-like object
measure_codes = io.BytesIO(measure_codes.encode('utf-8'))

# Read in file, interpreting 4 spaces or more as a delimiter.
# Using a regex like this requires using the slower Python engine.
# Use skiprows=1 to skip the header
# Use dtype="str" to avoid converting measure code to integer.
df = pd.read_csv(measure_codes, engine="python", sep=" {4,}", skiprows=1, dtype="str")

print(df)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.