How to convert pdf into dataframe pandas python and extract values?

Question

I download the pdf file online and want to put it into pandas dataframe. The next step is to extract the CAS and REACH number in dataframe.

Could anyone help me with that?

Here is the pdf link (updated). (https://msdspds.castrol.com/ussds/amersdsf.nsf/Files/109BFD5F3F227AE58025859100538A55/$File/2620961.pdf)

I want the CAS number and REACH number from section 3 in the pdf.

Many thanks Joan

Which numbers do you want? Section 3 does not contain any numbers. — Joooeey
– Joooeey, Commented Jul 10, 2020 at 18:01
tabula is a fairly good python library for converting tables in a PDF to Pandas Dataframes. You could give it a shot. — Joooeey
– Joooeey, Commented Jul 10, 2020 at 18:03
@Joooeey sorry for late reply, I give you the wrong pdf link, here is the new link [msdspds.castrol.com/ussds/amersdsf.nsf/Files/…. Many thanks — Joan Mok
– Joan Mok, Commented Jul 10, 2020 at 23:58
I can't access this link "General Error This safety data sheet is currently not available in , Please contact your local supplier for further information " — Joooeey
– Joooeey, Commented Jul 11, 2020 at 17:46
Actually one of the gotchas with tabula is that pages=1 is the default so it only reads the first page but you'll need tabula.read_pdf('A320.pdf', stream=True, pages=2) or pages='all'. — Joooeey
– Joooeey, Commented Jul 12, 2020 at 19:26

Chuk Robertson · Accepted Answer · 2020-09-15 13:47:23Z

I have had this issue with tabula as well. I have found a solution using PyPDF2 along with tabula.

Jupyter Notebook on Ubuntu FWIW

First cell imports all the stuff.

# Import modules needed for this project
import tabula as tb
from PyPDF2 import PdfFileReader
import pandas as pd
import glob

This is where we use PyPDF2 for reading how many pages the pdf contains. tabula cannot do this and we need an accurate count to pass to the next loop that reads the pdf page by page into tabula and converts them to csv.

# This cell gets a list of pages in the pdf. We cannot rely on reading the file as a whole :(
# We will pass this list into the next cell.

infile = '../PDFs/2620961.pdf'

# Get number of pages from pdf infile
pdf = PdfFileReader(open(infile,'rb'))
numPages = pdf.getNumPages()

# Get a list of pages to pass into the reader loop
tmpPages = []
for i in range(numPages):
    tmpPages.append(i++1)
    
print("There are ",len(tmpPages),"pages.")

This cell now loops tabula.convert_into by allowing passing pagenumbers(i) into the 'pages=' argument. tabula.read_pdf does not allow this so it seems this is my only option.

# This loops over the main pdf file page by page, saving each page as a csv in the /pages directory
# THIS MIGHT TAKE SOME TIME IF THE FILE IS LARGE
print(len(tmpPages)," pages to be converted.") # Here is our list of pages.

# This for loop takes the list of pages in the PDF from the previous cell.
# This loop also converts the PDF into individual CSVs and saves them to /pages
for i in tmpPages:
    print("Converting page: "+str(i))
    tb.convert_into(infile,
                    "../pages/page-"+str(i)+".csv",
                    guess=True,
                    output_format="CSV",
                    stream=True,
                    pages=i,
                    silent=True)
        
print("Done!")

Finally we just use pandas to read in all of the CSVs we created in the previous cell to create one dataframe from all of the converted pdf pages.

# This cell takes the CSVs from the previous cell and converts them into one DataFrame
path = r'../pages/' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, names=[0,1,2,3,4,5], index_col=0, header=None)
    li.append(df)

frame = pd.concat(li, ignore_index=False)
frame

From here you can clean up your dataframe.

Here are a few lines of the dataframe. It is very dirty, but I believe the numbers you were looking for are here.

    1   2   3   4   5
0                   
Product/ingredient name     Oral (mg/   Dermal  Inhalation  Inhalation  Inhalation
NaN     kg)     (mg/kg)     (gases)     (vapours)   (dusts
NaN     NaN     NaN     (ppm)   (mg/l)  and mists)
NaN     NaN     NaN     NaN     NaN     (mg/l)
maleic anhydride    500     NaN     NaN     NaN     NaN
Phosphorodithioic acid, mixed O,O-bis   REACH #: 01-2119493628-22   ≤2.4    Skin Irrit. 2, H315     [1] [2]     NaN
(iso-bu and pentyl) esters, zinc salts  EC: 270-608-0   NaN     Eye Dam. 1, H318    NaN     NaN
NaN     CAS: 68457-79-4     NaN     Aquatic Chronic 2, H411     NaN     NaN

I cannot use the tb.convert_into function. It gives me the error 'Command '['java', '-Djava.awt.headless=true', '-jar', '/Applications/anaconda3/lib/python3.7/site-packages/tabula/tabula-1.0.3-jar-with-dependencies.jar', '--pages', '1', '--stream', '--guess', '--outfile', '../pages/page-1.csv', '/Users/chueckingmok/Downloads/MSDS_743681 (3) copy.pdf']' returned non-zero exit status 1.' Even though I install the tabula again.
The python packages require Java. Check your installation of java. I have also verified this works in jupyter notebook via anaconda on windows10.

Collectives™ on Stack Overflow

How to convert pdf into dataframe pandas python and extract values?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related