0

I am using tabula to read tables form pdf files.

tables = tabula.read_pdf(file, pages="all")

This works fine. Now tables is a list of dataframes, where each data frame is a table fromt eh pdf file.

However the table rows are indexed 0,1,2,3.. etc. But the first row is taken as the column name or header of each dataframe.

Current dataframe:

  Component manufacturer               DMNS
0         Component name               KL32/OOH8
1         Component type               LTE-M/NB-IoT
2       Package markings               <pin 1 marker>\ ksdc 99cdjh
3              Date code               Not discerned
4           Package type               127-pin land grid array (LGA)
5           Package size               26.00 mm × 10.11 mm × 3.05 mm

Desired Dataframe:

        0                                1
0       Component manufacturer           DMNS
1       Component name                   KL32/OOH8
2       Component type                   LTE-M/NB-IoT
3       Package markings                 <pin 1 marker>\ ksdc e99cdjh
4       Date code                        Not discerned
5       Package type                     127-pin land grid array (LGA)
6       Package size                     26.00 mm × 10.11 mm × 3.05 mm

How can I do this transformation?

2 Answers 2

2

As the tabula docs on read_pdf state, you can add pandas_options and they even give the one you need as an example - {'header': None}. So (something like) this should do the trick:

tabula.read_pdf(file, pages="all", pandas_options={'header': None})

Edit: So apparently that should only work if you set multiple_tables to False which is not the default. I'd play with the options a bit and if it doesn't give the desired result, here is a post on how to turn the column names into the first row.

Sign up to request clarification or add additional context in comments.

Comments

1

Here's a way to do what your question asks:

df = df.T.reset_index().T.reset_index(drop=True)

Output:

                        0                              1
0  Component manufacturer                           DMNS
1          Component name                      KL32/OOH8
2          Component type                   LTE-M/NB-IoT
3        Package markings    <pin 1 marker>\ ksdc 99cdjh
4               Date code                  Not discerned
5            Package type  127-pin land grid array (LGA)
6            Package size  26.00 mm × 10.11 mm × 3.05 mm

Explanation:

  • Transpose the dataframe so we can use reset_index() to convert the index (i.e., the original column labels) to a new initial column
  • Transpose it again so the new initial column becomes an initial row, and use reset_index() to get a fresh integer index.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.