I have had this issue with tabula as well. I have found a solution using PyPDF2 along with tabula.
Jupyter Notebook on Ubuntu FWIW
First cell imports all the stuff.
# Import modules needed for this project
import tabula as tb
from PyPDF2 import PdfFileReader
import pandas as pd
import glob
This is where we use PyPDF2 for reading how many pages the pdf contains. tabula cannot do this and we need an accurate count to pass to the next loop that reads the pdf page by page into tabula and converts them to csv.
# This cell gets a list of pages in the pdf. We cannot rely on reading the file as a whole :(
# We will pass this list into the next cell.
infile = '../PDFs/2620961.pdf'
# Get number of pages from pdf infile
pdf = PdfFileReader(open(infile,'rb'))
numPages = pdf.getNumPages()
# Get a list of pages to pass into the reader loop
tmpPages = []
for i in range(numPages):
tmpPages.append(i++1)
print("There are ",len(tmpPages),"pages.")
This cell now loops tabula.convert_into by allowing passing pagenumbers(i) into the 'pages=' argument. tabula.read_pdf does not allow this so it seems this is my only option.
# This loops over the main pdf file page by page, saving each page as a csv in the /pages directory
# THIS MIGHT TAKE SOME TIME IF THE FILE IS LARGE
print(len(tmpPages)," pages to be converted.") # Here is our list of pages.
# This for loop takes the list of pages in the PDF from the previous cell.
# This loop also converts the PDF into individual CSVs and saves them to /pages
for i in tmpPages:
print("Converting page: "+str(i))
tb.convert_into(infile,
"../pages/page-"+str(i)+".csv",
guess=True,
output_format="CSV",
stream=True,
pages=i,
silent=True)
print("Done!")
Finally we just use pandas to read in all of the CSVs we created in the previous cell to create one dataframe from all of the converted pdf pages.
# This cell takes the CSVs from the previous cell and converts them into one DataFrame
path = r'../pages/' # use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, names=[0,1,2,3,4,5], index_col=0, header=None)
li.append(df)
frame = pd.concat(li, ignore_index=False)
frame
From here you can clean up your dataframe.
Here are a few lines of the dataframe. It is very dirty, but I believe the numbers you were looking for are here.
1 2 3 4 5
0
Product/ingredient name Oral (mg/ Dermal Inhalation Inhalation Inhalation
NaN kg) (mg/kg) (gases) (vapours) (dusts
NaN NaN NaN (ppm) (mg/l) and mists)
NaN NaN NaN NaN NaN (mg/l)
maleic anhydride 500 NaN NaN NaN NaN
Phosphorodithioic acid, mixed O,O-bis REACH #: 01-2119493628-22 ≤2.4 Skin Irrit. 2, H315 [1] [2] NaN
(iso-bu and pentyl) esters, zinc salts EC: 270-608-0 NaN Eye Dam. 1, H318 NaN NaN
NaN CAS: 68457-79-4 NaN Aquatic Chronic 2, H411 NaN NaN
tabulais a fairly good python library for converting tables in a PDF to Pandas Dataframes. You could give it a shot.tabulais thatpages=1is the default so it only reads the first page but you'll needtabula.read_pdf('A320.pdf', stream=True, pages=2)orpages='all'.