2

I want to read the coordinates of a particular line in a particular page of the pdf using python. However, I am unable to find the suitable library to do so. Therefore, I'm using this code mentioned below in C#. Anyone who can help me to find a wrapper in python through which this code becomes operational in python.

Code:

using System;
using System.Drawing;
using Bytescout.PDFExtractor;

<span data-scayt_word="namespace" data-scaytid="18">namespace</span> <span data-scayt_word="FindText" data-scaytid="19">FindText</span>
{
    class Program
    {
        static void Main(string[] <span data-scayt_word="args" data-scaytid="43">args</span>)
        {
            // Create Bytescout.PDFExtractor.TextExtractor instance
            <span data-scayt_word="TextExtractor" data-scaytid="20">TextExtractor</span> extractor = new <span data-scayt_word="TextExtractor" data-scaytid="21">TextExtractor</span>();
            extractor.RegistrationName = "demo";
            extractor.RegistrationKey = "demo";

            // Load sample PDF document
            extractor.LoadDocumentFromFile("sample1.pdf");

            <span data-scayt_word="int" data-scaytid="22">int</span> <span data-scayt_word="pageCount" data-scaytid="48">pageCount</span> = extractor.GetPageCount();
            <span data-scayt_word="RectangleF" data-scaytid="50">RectangleF</span> location;

            for (<span data-scayt_word="int" data-scaytid="23">int</span> i = 0; i < <span data-scayt_word="pageCount" data-scaytid="49">pageCount</span>; i++)
            {
                // Search each page for "<span data-scayt_word="ipsum" data-scaytid="24">ipsum</span>" string
                if (extractor.Find(i, "<span data-scayt_word="ipsum" data-scaytid="25">ipsum</span>", false, out location))
                {
                    do
                    {
                        Console.WriteLine("Found on page " + i + " at location " + location.ToString());

                    }
                    while (extractor.FindNext(out location));
                }
            }

            Console.WriteLine();
            Console.WriteLine("Press any key to continue...");
            Console.ReadLine();
        }
    }
}

1 Answer 1

3

I see three options for you to run this code from a python program (assuming you are on Windows):

Preferrable: If it is possible for you to use the IronPython interpreter (see ironpython.net), you can use the PDFExtractor class directly from the python code:

import clr    
clr.AddReferenceToFileAndPath('c:\\path\\to\\pdfextractor.dll')
from Bytescount.PDFExtractor import TextExtractor
extractor = TextExtractor()
extractor.RegistrationName = 'demo'
# etc

Alternatively: Use the C# compiler csc.exe to compile your C# program before you run it (save your C# program as Extract.cs, make sure that it accepts the path to the pdf-file as input parameter):

import os,tempfile,shutil
csc = 'c:\\WINDOWS\\Microsoft.Net\\Framework64\\v4.0.30319\\csc.exe' # Or somewhere else, see below
filename = 'c:\\path\\to\\pdffile.pdf'
tempdir = tempfile.mkdtemp(prefix='Extract-temp-')
os.system(csc + ' /t:exe /out:' + tempdir + '\\Extract.exe c:\\path\\to\\Extract.cs /r:c:\\path\\to\\PDFExtractor.dll')
with os.popen(tempdir + '\\Extract.exe '+filename) as F:
    extractResult = F.read()
shutil.rmtree(tempdir)
print(extractResult)

Up to .NET Framework version 4.5 / C# 5, csc.exe was included in the framework install. To get a version of csc.exe that supports C# 6.0, consult e.g. stackoverflow.com/questions/39089426.

Finally, you can use ctypes and the "Unmanaged Exports (DllExport for .Net)" NuGet package to call a C# assembly directly from CPython, as outlined in stackoverflow.com/questions/7367976.

EDIT based on denfromufa's comment: The best way to script PDFExtractor from python is to use pythonnet in CPython (you can install it on windows by python -m pip install pythonnet) With this approach, your C# program above can be replaced with this script (tested with python 2.7, win32):

import clr
# 'import System'  will work here (must be after 'import clr')
# You can also import System.Drawing and other .NET namespaces
clr.AddReference(r'c:\path\to\Bytescout.PDFExtractor.dll')
from Bytescout.PDFExtractor import TextExtractor
extractor = TextExtractor()
extractor.RegistrationName = 'demo'
extractor.RegistrationKey = 'demo'
extractor.LoadDocumentFromFile(r'c:\path\to\mydoc.pdf')
pageCount = extractor.GetPageCount()
for i in range(pageCount):
    result = extractor.Find(i,"somestring",False)
    while (result):
        print('Found on page '+str(i)+' on location '+str(extractor.FoundText.Bounds))
        result = extractor.FindNext()
Sign up to request clarification or add additional context in comments.

5 Comments

Hey @sveinbr ! Thank you for the help. The last method can surely work out. However, can you please help me how I can use the initial 3 libraries (Bytescout.PDFExtractor, System.Drawing, System) of C# in Python?
@Prabal: Hi, as denfromufa pointed out in the comment, pythonnet can do the same thing as the last method (marked Finally) in a better way. I will update the answer with a recipe.
@sveinbr: The method stated by you and denfromufa may work for sure. However, notice the line: clr.AddReference(r'c:\path\to\Bytescout.PDFExtractor.dll') . The problem is Bytescout library is a paid one. Can you suggest any alternative to this library?
@Prabal: I have no experience with PDFExtractor or similar libraries myself, but after doing a quick google search, I think the open source and pure python pdfminer could be what you need. (btw: please accept my answer if you find that it solved the question in the original post)
@Prabal accepting correct answers is considered polite on StackOverflow. You can use gray check box under voting buttons to mark answer as accepted.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.