0

I'm using PDFBox in java and successfully retrieved a pdf. But now I wish to search for a specific word and only retrieve the following number. To be concrete, I want to search for Tax and retrieve the number that is tax. The two strings are separated by a tab it seems.

My code is as following atm

  File file = new File("yes.pdf");
try {
     PDDocument document = PDDocument.load(file);
     PDFTextStripper pdfStripper = new PDFTextStripper();

String text = pdfStripper.getText(document);

System.out.println(text);

// search for the word tax
// retrieve the number af the word "Tax"

document.close();
}
2
  • what is after the tax number? a space? a tab? something else? Commented Dec 4, 2019 at 10:36
  • Yes, the word tax is followed by a space and then the number Commented Apr 30, 2020 at 22:13

2 Answers 2

3

I have used similar thing in my project. I hope it will help you.

public class ExtractNumber {

public static void main(String[] args) throws IOException { 
    PDDocument doc = PDDocument.load(new File("yourFile location"));

    PDFTextStripper stripper = new PDFTextStripper();
    List<String> digitList = new ArrayList<String>();

    //Read Text from pdf 
    String string = stripper.getText(doc);

    // numbers follow by string
    Pattern mainPattern = Pattern.compile("[a-zA-Z]\\d+");

    //Provide actual text
    Matcher mainMatcher = mainPattern.matcher(string);
    while (mainMatcher.find()) {
        //Get only numbers
        Pattern subPattern = Pattern.compile("\\d+");
        String subText = mainMatcher.group();
        Matcher subMatcher = subPattern.matcher(subText);
        subMatcher.find();
        digitList.add(subMatcher.group());
    }

    if (doc != null) {
        doc.close();
    }

    if(digitList != null && digitList.size() > 0 ) {
        for(String digit: digitList) {
            System.out.println(digit);
        }
    }
}

}

Regular expression [a-zA-Z]\d+ find one or more digit follow by latter from pdf text.

\d+ expression find specific text from above pattern.

you can also use different regular expression for find specific number of digit.

You can get more idea from this tutorial.

Sign up to request clarification or add additional context in comments.

Comments

2

The best way to do something like that is to use regular expressions. I often use this tool to write my regular expressions. Your regex should probably look something like: tax\s([0-9]+). You can take a look at this tutorial on how to use regex in Java.

1 Comment

I use and love that website. Peter: a \ that you use in that website must be a \\ in java or it won't work. If you bring an excerpt of your file (anonymize any real numbers), somebody here could create some code that extracts what you need. You'll need Matcher, find(), and group(). Add an appropriate label, this isn't really a PDFBox question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.