-1

I have a poorly structured dataframe that was generated by reading tables in directly from a pdf.

I am trying to manipulate some of the data before putting it into visualization tools.

A key transformation I am trying to make is to extract a column header and use it as a row label. Here is an example of the kind of dataframe I am working with:

data = {'Col1': ['Alabama', 'nan', 'nan', 'nan', 'Wyoming', 'nan', 'nan', 'nan'],
        'Col2': ['nan', 1, 2, 3, 'nan', 1, 2, 3]}

df = pd.DataFrame(data)

The resulting dataframe looks a bit like this:

    Col1    Col2
0   AL  nan
1   nan 1
2   nan 2
3   nan 3
4   WY  nan
5   nan 1
6   nan 2
7   nan 3

Whereby the entries in Col 1 are mostly nan except for those on row 0 (AL) and row 4 (WY). These were effectively subheaders in the table in the pdf.

I am trying to write a code that takes the last valid value in Col1 (e.g., AL) and then fills the remaining rows below it until it encounters the next valid value (e.g., WY).

Correct output would look something like this:

    Col1    Col2
0   AL  nan
1   AL  1
2   AL  2
3   AL  3
4   WY  nan
5   WY  1
6   WY  2
7   WY  3

I am somewhat at a loss for how to proceed here and welcome any advise on where to start out.

2

1 Answer 1

2

Do this:

df.Col1.fillna(method='ffill', inplace=True)
Sign up to request clarification or add additional context in comments.

3 Comments

Probably need to replace "nan" with np.nan before.
If the answer is "use this function that's already part of your library exactly the way the docs show you how to", then the question is generally a duplicate. In such situations, please look for the duplicate and flag/vote to close instead of adding an answer
This works great. I actually replaced the 'nan' with None and worked great. Thank you!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.