3

Test data fileI am working with a dataframe that I created from a CSV file. The data has header rows throughout the data, which identify something about the rows below that data, until the next header row.

The data looks something like this.

2001|     |colour |Price | Quantity sold<br>
Shoes|<br>
Blank  | High heal Shoes| red |£22|44<br>
Blank  | Low heal Shoes|red |£22|44<br>
Slippers|<br>
Blank  | High heal Slippers| red |£22|44<br>
Blank  | High heal Slippers| blue |£22|44<br>
Blank  | Low heal Slippers| red |£22|44<br>
2002|   |colour |Price | Quantity sold<br>
Shoes|<br>
Blank  | High heal Shoes| red |£22|44<br>
Blank  | Low heal Shoes|red |£22|44<br>
Slippers|<br>
Blank  | High heal Slippers| red |£22|44<br>
Blank  | High heal Slippers| blue |£22|44<br>
Blank  | Low heal Slippers| red |£22|44<br>

What type of structure is this?

I need to read through this dataframe get all data on a particular item (say Slippers) for each year from the header row (so 2001, 2002 and so on). Even adding a row with the corresponding year next to each data row would help.

I will appreciate some help on how to do that?

1 Answer 1

2

Use:

df = pd.read_csv('test.csv')

#get value of first column (here 2001)
col = df.columns[0]

#forward fill last previous value
df[col] = df[col].ffill()
#convert first column to numeric
num = pd.to_numeric(df[col], errors='coerce')
#forward fill again, first group replace by value of first column name
df['Year'] = num.ffill().fillna(col)
#change columns names 
df = df.rename(columns={col:'Shoes', 'Unnamed: 1':'Names'})
#remove unnecessary rows
df = df[num.isnull() & df['colour'].notnull()].reset_index(drop=True)
print (df)
           Shoes       Names  colour price Quantity sold  Year
0   Type A shoes  Sub type A     red    22             5  2001
1   Type A shoes  Sub type A   green    11             5  2001
2   Type A shoes  Sub type A  yellow    44             5  2001
3   Type A shoes  Sub type B     red    33             5  2001
4   Type A shoes  Sub type B   green    66             5  2001
5   Type A shoes  Sub type B  yellow    22             5  2001
6   Type B shoes  Sub type A     red    11             5  2001
7   Type B shoes  Sub type A   green    44             5  2001
8   Type B shoes  Sub type A  yellow    33             5  2001
9   Type B shoes  Sub type B     red    66             5  2001
10  Type B shoes  Sub type B   green    21             5  2001
11  Type B shoes  Sub type B  yellow    22             5  2001
12  Type A shoes  Sub type A     red    22             5  2002
13  Type A shoes  Sub type A   green    11             5  2002
14  Type A shoes  Sub type A  yellow    44             5  2002
15  Type A shoes  Sub type B     red    33             5  2002
16  Type A shoes  Sub type B   green    66             5  2002
17  Type A shoes  Sub type B  yellow    22             5  2002
18  Type B shoes  Sub type A     red    11             5  2002
19  Type B shoes  Sub type A   green    44             5  2002
20  Type B shoes  Sub type A  yellow    33             5  2002
21  Type B shoes  Sub type B     red    66             5  2002
22  Type B shoes  Sub type B   green    21             5  2002
23  Type B shoes  Sub type B  yellow    22             5  2002

EDIT:

df = pd.read_csv('testV2.csv', sep='\t')
#print (df)

#get value of first column (here 2001)
col = df.columns[0]

#forward fill last previous value
df[col] = df[col].ffill()
#convert first column to numeric
num = pd.to_numeric(df[col], errors='coerce')
#forward fill again, first group replace by value of first column name
df['Year'] = num.ffill().fillna(col)
#change columns names 
df = df.rename(columns={col:'Top Category', 'Unnamed: 1':'Names'})
#remove unnecessary rows
df = df[num.isnull() & (df['Top Category'] != 'Top Category')].reset_index(drop=True)

print (df)

   Top Category   Names Colour Price Sold  Year
0        Item 1  Type 1      -     2  NaN  2001
1        Item 2  Type 1      -     2  NaN  2001
2        Item 3  Type 1    red     2    5  2001
3        Item 3  Type 2   blue     2    5  2001
4        Item 3  Type 3  green     2    5  2001
5        item 4  Type 1    red     2    5  2001
6        item 4  Type 2   blue     3  NaN  2001
7        item 4  Type 3  green     3  NaN  2001
8        Item 1  Type 1      -     3  NaN  2002
9        Item 2  Type 1      -     3  NaN  2002
10       Item 3  Type 1    red     3    5  2002
11       Item 3  Type 2   blue     3    5  2002
12       Item 3  Type 3  green     3    5  2002
13        Item4  Type 1    red     3  NaN  2002
14        Item4  Type 2   blue     3  NaN  2002
15        Item4  Type 3  green     3  NaN  2002
16       Item 1  Type 1      -     3  NaN  2003
17       Item 2  Type 1      -     3  NaN  2003
18       Item 3  Type 1    red     3    5  2003
19       Item 3  Type 2   blue     3    5  2003
20       Item 3  Type 3  green     3    5  2003
21        Item4  Type 1    red     3  NaN  2003
22        Item4  Type 2   blue     3  NaN  2003
23        Item4  Type 3  green     3  NaN  2003
Sign up to request clarification or add additional context in comments.

15 Comments

Thanks for the reply. I don't understand what is happening on some line. I hope you don't mind me asking some question.What does this line do? df[col] = df[col].str.strip().replace('Blank', np.nan).ffill() And what does forward fill does in particular?
No problem. But if my solution does not work maybe problem is with real format of your file, so is possible share your sample file with real separators, real blanks values?
ffill() replace last known non NaN values, so if 1,2,NaN,NaN,4,7, NaN it return 1,2,2,2,4,7,7
Thanks you. link Here is a link to a demo file with formatted data.
Thanks for all your help. I will check the code later.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.