0

I am scraping multiple tables from multiple pages of a website. The issue is there is a row missing from the initial table. Basically, this is how the dataframe looks.

enter image description here

                               mar2018 feb2018 jan2018 dec2017 nov2017                              
                                                                                      oct2017 sep2017 aug2017  

                balls faced      345     561    295       0      645     balls faced    200    58      0
                runs scored      156     281    183       0      389     runs scored    50     20      0
                strike rate      52.3    42.6   61.1      0      52.2    strike rate    25     34      0
                dot balls        223     387    173       0      476     dot balls      125    34      0
                fours            8       12     19        0      22      sixes          2      0       0   
                doubles          20      38     16        0      36      fours          4      2       0
                notout           2       0      0         0      4       doubles        2      0       0
                                                                         notout         4      2       0

the column 'sixes' is missing in the first page and present in the subsequent pages. So, I am trying to move the rows starting from 'fours' to 'not out' to a position down and leave nan's in row 4 for first 5 columns starting from mar2018 to nov2017.

I tried the following code but it isn't working. This is moving the values horizontally but not vertically downward.

df.iloc[4][0:6] = df.iloc[4][0:6].shift(1)

and also

df2 =  pd.DataFrame(index = 4)
df = pd.concat([df.iloc[:], df2, df.iloc[4:]]).reset_index(drop=True)

did not work.

df['mar2018'] = df['mar2018'].shift(1)

But this moves all the values of that column down by 1 row.

So, I was wondering if it is possible to shift down rows of specific columns from a specific index?

1 Answer 1

1

I think need reindex by union by numpy.union1d of all index values:

idx = np.union1d(df1.index, df2.index)

df1 = df1.reindex(idx)
df2 = df2.reindex(idx)

print (df1)
             mar2018  feb2018  jan2018  dec2017  nov2017
balls faced    345.0    561.0    295.0      0.0    645.0
dot balls      223.0    387.0    173.0      0.0    476.0
doubles         20.0     38.0     16.0      0.0     36.0
fours            8.0     12.0     19.0      0.0     22.0
notout           2.0      0.0      0.0      0.0      4.0
runs scored    156.0    281.0    183.0      0.0    389.0
sixes            NaN      NaN      NaN      NaN      NaN
strike rate     52.3     42.6     61.1      0.0     52.2

print (df2)
             oct2017  sep2017  aug2017
balls faced      200       58        0
dot balls        125       34        0
doubles            2        0        0
fours              4        2        0
notout             4        2        0
runs scored       50       20        0
sixes              2        0        0
strike rate       25       34        0

If multiple DataFrames in list is possible use list comprehension:

from functools import reduce

dfs = [df1, df2]
idx = reduce(np.union1d, [x.index for x in dfs])

dfs1 = [df.reindex(idx) for df in dfs]

print (dfs1)
[             mar2018  feb2018  jan2018  dec2017  nov2017
balls faced    345.0    561.0    295.0      0.0    645.0
dot balls      223.0    387.0    173.0      0.0    476.0
doubles         20.0     38.0     16.0      0.0     36.0
fours            8.0     12.0     19.0      0.0     22.0
notout           2.0      0.0      0.0      0.0      4.0
runs scored    156.0    281.0    183.0      0.0    389.0
sixes            NaN      NaN      NaN      NaN      NaN
strike rate     52.3     42.6     61.1      0.0     52.2,      oct2017  sep2017  aug2017
balls faced      200       58        0
dot balls        125       34        0
doubles            2        0        0
fours              4        2        0
notout             4        2        0
runs scored       50       20        0
sixes              2        0        0
strike rate       25       34        0]
Sign up to request clarification or add additional context in comments.

11 Comments

Thanks a lot @jezrael. np.union1d method is working when I'm printing the tables in each page to different csv's and then joining them. my method while scraping was to concat all the tables to one dataframe and then clean that single dataframe. I just added a picture of the csv file and edited the question so that it's clear. Is there any way to shift the cells in the same dataframe?
@Johny I am confused. Your structure is weird, do you create it by concat? Data are confidental? Webpage is confidental? Because in my opinion is best create lost of all dataframes first. And then apply my solution.
@Jhonny - OK, thank you. But one thing, what is your code for extract tables? Use pd.read_html or beatifulsopa or something else?
I wasn't able to use pd.read_html as the url is static for multiple pages. I am using beautiful soup - table = BeautifulSoup(url, 'html5lib').find_all('table')[4] for tr in table: #extract tr for td in table: #extract td
@Jhonny - thank you. I create one DataFrame with table = soup.find_all('table')[4] and df = pd.read_html(str(table), header=0, index_col=0)[0], but how is possible extract more tables if same url?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.