3

I am using a table detection module to detect the table and extract the content from it. I am using a pandas data frame to order the data in the table structure.

Scenario - 1.

I need to merge column 4(Amount) with column 5(empty header).

enter image description here

the expected output like this,

enter image description here

Scenario - 2

In this the price, amount value extracted in other columns, I need to move back to its original column.

enter image description here

the expected result is, enter image description here

NOTE: All values are dynamic, it will change for other types of images.

4
  • There is necessary columns names 0,1,2,3 ? not Article no., 'Description', ... ? Commented Dec 16, 2019 at 8:09
  • I am facing this issue for more images. For other images, the column number will be different. For those images, I am facing dataframe issue. Commented Dec 16, 2019 at 8:32
  • Answer was edited, can you check? Commented Dec 16, 2019 at 8:50
  • Sorry, but I didn't get the expected result. It will the same as the original format. Commented Dec 16, 2019 at 9:52

2 Answers 2

1

One idea is combine all rows without first with convert to strings and extract column by DataFrame.pop:

df.loc[df.index[1:], 5] = df.loc[df.index[1:], 5].astype(str) + df.pop(4).iloc[1:]
df.loc[df.index[1:], 8] = df.loc[df.index[1:], 8].astype(str) + df.pop(7).iloc[1:]
df.columns = np.arange(len(df.columns))
print (df)
             0                   1         2         3       4      5  \
0  Article no.         Description   Content  Quantity   Price    VAT   
1        18001  Thai Mineral water  28X0,33L       400  6,160E  O 0/0   

          6  
0     Total  
1  2464,00E  

Or if possible empty string in first row use:

df[5] = df[5].astype(str) + df.pop(4)
df[8] = df[8].astype(str) + df.pop(7)
df.columns = np.arange(len(df.columns))
print (df)
             0                   1         2         3       4      5  \
0  Article no.         Description   Content  Quantity   Price    VAT   
1        18001  Thai Mineral water  28X0,33L       400  6,160E  O 0/0   

          6  
0     Total  
1  2464,00E  

Last if necessary convert first row to columns names:

df.columns = df.iloc[0]
df = df.rename_axis(None, axis=1).iloc[1:].reset_index(drop=True)
print (df)
  Article no.         Description   Content Quantity   Price    VAT     Total
0       18001  Thai Mineral water  28X0,33L      400  6,160E  O 0/0  2464,00E

More general solution use groupby with sum with created duplicated columns names:

#convert missing values to empty string
df.iloc[0] = df.iloc[0].fillna('')

#convert columnc names to series
s =  df.columns.to_series()

#if empty string in first row then replace column name by next one
df.columns = s.where(df.iloc[0].ne('')).bfill()
#for join use sum
df = df.groupby(df.columns, axis=1).sum()
#set default columns names
df.columns = np.arange(len(df.columns))
print (df)
             0                   1         2         3       4      5  \
0  Article no.         Description   Content  Quantity   Price    VAT   
1        18001  Thai Mineral water  28X0,33L       400  E6,160  O 0/0   

          6  
0     Total  
1  E2464,00  
Sign up to request clarification or add additional context in comments.

6 Comments

The currency symbol and header name are dynamic. It will change for other types of images.
Yes, but the result does not change.
@BHARATH - It not working for any data? Can you be more specific?
Your 3rd answer works for the first scenario but in the second scenario, it does not work. Other answers do not work for both scenarios.
@BHARATH - What is second scenario?
|
0

Another possible solution:

import numpy as np
import unicodedata

#locating the currencies
currencies = ['DOLLAR SIGN','EURO SIGN','POUND SIGN','RUPEE SIGN']
#list of a few currencies https://www.fileformat.info/info/unicode/category/Sc/list.htm

pos = []
bag = []
for val in df.values: #val is ndarray type

    s  = np.array_split(val,len(df.columns))
    bag.append(s)

for cur in currencies:

        symbol = np.where(bag == np.array([unicodedata.lookup(cur)]))   

        if symbol[0] != np.array([]):
            pos.append(symbol)

1st are rows, 2nd are columns

for p in pos:

    for r,c in zip(p[0],p[1]):
        ncol = c+1
        bag[r][ncol] = bag[r][c]+bag[r][ncol].astype(str)  # replace the money in good place        

#convert bag in dataframe
df2 = pd.DataFrame(bag)

to_drop = []
for cur in currencies:  

    d = unicodedata.lookup(cur)

    for col in df2.columns:

        if d in df2[col].tolist():
            if col not in to_drop:
                to_drop.append(col)

#drop undesired columns
df2 = df2.drop(columns=to_drop)

This is the output for your first excel print

    0                     1  ...        6           8
0  [Article no.]         [Description]  ...    [VAT]     [Total]
1        [18001]  [Thai Mineral water]  ...  [O °/o]  [€2464,00]

[2 rows x 7 columns]

dataframe:

df = pd.DataFrame([['Article no.','Description','Content','Quantity','','Price','VAT','','Total'],
                [18001,'Thai Mineral water','28X0,33L','400','€','6,160','O °/o','€','2464,00']]
                ,columns=[0,1,2,3,4,5,6,7,8])   

And the output for your second excel print

     0       1           2        3           5
0  [Description]      []  [Quantity]  [Price]          []
1      [Gourmet]  [AXML]       [781]   [9,00]  [$7029,00]
2        [Taste]  [BXML]       [398]   [8,90]  [$3542,20]

dataframe:

df = pd.DataFrame([['Description','','Quantity','Price','Amount/GBP',''],
                ['Gourmet','AXML','781','9,00','$','7029,00'],
                ['Taste','BXML','398','8,90','$','3542,20']]
                ,columns=[0,1,2,3,4,5])

7 Comments

The currency symbol and header name are dynamic. It will change for other types of images.
With this euro = np.where(s == np.array(['€'])) you will find euros positions. And you will get headers name with this: header = list(df.loc[0])
Sir sorry for confusing you..., for my extraction the data should be changed. So the currency value change from euro to some other countries currency type. So it should not be a fixed one. Same as table header names. All the fields are dynamic.
Hi, Please check my updated question. All the values are changing.
@BHARATH check out. I've changed the code based on your comments.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.