0

I'm trying to extract data from multiple multiple tables in multiple pdf and save it in csv format. I did my research and found python-camelot is good tool to extract. I tried and it works perfectly fine on a single pdf. However, I have over 50 PDFs in the same format so i decided to iterate over all files using For loop but it did not work and i get an error files are not found in the directory. can you please help. Here is the code:

import tkinter 
import camelot
import os

directory = 'C:\\Users\\Alr\\Desktop\\test\\'
files = [ filename for filename in os.listdir(directory)]
for i in range (len(files)):
    tables = camelot.read_pdf(files[i], pages='5,6,7')
    tables.export(files[i], f='csv', compress=True) # json, excel, html, sqlite
    tables.to_csv(files[i]+'.csv')

4
  • files never gets set to any value. Did you forget to read the folder? Commented Mar 11, 2020 at 21:40
  • @usr2564301 thank you for your replay.. I forgot to include - just updated the code Commented Mar 11, 2020 at 22:30
  • 1
    Now the issue is clear – a common mistake, alas. os.listdir returns the names of the files and that means that the path is not included. Just prepend directory to the file name in read_pdf and you're set. Commented Mar 12, 2020 at 0:02
  • @usr2564301 Thank you so much I think you are right not it's working after i added the path to the name. However, i have problem with exporting it as i use the filename as name for the csv file but it includes the ".pdf" in the name and now the code is throughing an error. so is there any method to take out the .csv from the name and just use the file name Commented Mar 12, 2020 at 11:17

1 Answer 1

3

As suggested in the comments, the problem is that os.listdir returns only filenames and not complete paths.

You can try this:

import tkinter 
import camelot
import glob

directory = 'C:\\Users\\Alr\\Desktop\\test\\*.pdf'
files = [filename for filename in glob.glob(directory)]

for pdf_filepath in files:
    csv_filepath=pdf_filepath.replace('.pdf','.csv')
    tables = camelot.read_pdf(pdf_filepath, pages='5,6,7')

    # the following lines seem to be duplicate
    tables.export(csv_filepath, f='csv', compress=True) # json, excel, html, sqlite
    tables.to_csv(csv_filepath)
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you yes i think by adding the path is working now. However, I have a problem with 'tables.to_csv(files[i]+'.csv')' Im using files[i] to name the csv file every time i extract a table from a pdf file. As you might now the files[i] will include the file name + .pdf thus is there a way to remove .pdf from the name before i export it because right not its giving an error it exports .pdf.csv together
You can replace '.pdf' with '.csv' ---> tables.to_csv(files[i].replace('.pdf,'.csv'))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.