How to extract data from multiple PDFs in the same directory using python-camelot?

Question

I'm trying to extract data from multiple multiple tables in multiple pdf and save it in csv format. I did my research and found python-camelot is good tool to extract. I tried and it works perfectly fine on a single pdf. However, I have over 50 PDFs in the same format so i decided to iterate over all files using For loop but it did not work and i get an error files are not found in the directory. can you please help. Here is the code:

import tkinter 
import camelot
import os

directory = 'C:\\Users\\Alr\\Desktop\\test\\'
files = [ filename for filename in os.listdir(directory)]
for i in range (len(files)):
    tables = camelot.read_pdf(files[i], pages='5,6,7')
    tables.export(files[i], f='csv', compress=True) # json, excel, html, sqlite
    tables.to_csv(files[i]+'.csv')

files never gets set to any value. Did you forget to read the folder? — Jongware
– Jongware, Commented Mar 11, 2020 at 21:40
@usr2564301 thank you for your replay.. I forgot to include - just updated the code — Ahmad B
– Ahmad B, Commented Mar 11, 2020 at 22:30
Now the issue is clear – a common mistake, alas. os.listdir returns the names of the files and that means that the path is not included. Just prepend directory to the file name in read_pdf and you're set. — Jongware
– Jongware, Commented Mar 12, 2020 at 0:02
@usr2564301 Thank you so much I think you are right not it's working after i added the path to the name. However, i have problem with exporting it as i use the filename as name for the csv file but it includes the ".pdf" in the name and now the code is throughing an error. so is there any method to take out the .csv from the name and just use the file name — Ahmad B
– Ahmad B, Commented Mar 12, 2020 at 11:17

Stefano Fiorucci - anakin87 · Accepted Answer · 2020-03-12 08:34:45Z

3

As suggested in the comments, the problem is that os.listdir returns only filenames and not complete paths.

You can try this:

import tkinter 
import camelot
import glob

directory = 'C:\\Users\\Alr\\Desktop\\test\\*.pdf'
files = [filename for filename in glob.glob(directory)]

for pdf_filepath in files:
    csv_filepath=pdf_filepath.replace('.pdf','.csv')
    tables = camelot.read_pdf(pdf_filepath, pages='5,6,7')

    # the following lines seem to be duplicate
    tables.export(csv_filepath, f='csv', compress=True) # json, excel, html, sqlite
    tables.to_csv(csv_filepath)

answered Mar 12, 2020 at 8:34

Stefano Fiorucci - anakin87

3,57610 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ahmad B Over a year ago

Thank you yes i think by adding the path is working now. However, I have a problem with 'tables.to_csv(files[i]+'.csv')' Im using files[i] to name the csv file every time i extract a table from a pdf file. As you might now the files[i] will include the file name + .pdf thus is there a way to remove .pdf from the name before i export it because right not its giving an error it exports .pdf.csv together

Stefano Fiorucci - anakin87 Over a year ago

You can replace '.pdf' with '.csv' ---> tables.to_csv(files[i].replace('.pdf,'.csv'))

Collectives™ on Stack Overflow

How to extract data from multiple PDFs in the same directory using python-camelot?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related