2

I am trying to loop through each page on a PDF to look for specific keywords. Code works fine on other PDFs, except this one

My code

Using oReader As New pdf.PdfReader(pdfFilename)

    For pCurrent = oReader.NumberOfPages To 1 Step -1
        Dim strategy As pdf.parser.ITextExtractionStrategy = New pdf.parser.SimpleTextExtractionStrategy()
        Dim pageText As String = pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, pCurrent, strategy)

        '
        'search for keywords
        '
        'FindVOI

    Next 'proceed next page

End Using

Here is the snippet of code that causing this exception,

Dim pageText As String = pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, pCurrent, strategy)

Is showing exception Stack empty at page 98 on this PDF, any ideas what is wrong?

Full Exception:

Exception thrown: 'System.InvalidOperationException' in System.dll
System.Transactions Critical: 0 : <TraceRecord xmlns="http://schemas.microsoft.com/2004/10/E2ETraceEvent/TraceRecord" Severity="Critical"><TraceIdentifier>http://msdn.microsoft.com/TraceCodes/System/ActivityTracing/2004/07/Reliability/Exception/Unhandled</TraceIdentifier><Description>Unhandled exception</Description><AppDomain>VipMonitorService.vshost.exe</AppDomain><Exception><ExceptionType>System.InvalidOperationException, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089</ExceptionType><Message>Stack empty.</Message><StackTrace>   at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
   at System.Collections.Generic.Stack`1.Pop()
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.EndMarkedContentC.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
   at WatcherApp.VipMonitorService.PDFHelper.FindVOI(List`1 voiList, String pdfFilename, Boolean searchFromLast, Int32 searchNumberOfPagesInPercent) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\PDFHelper.vb:line 59
   at WatcherApp.VipMonitorService.Controller.ProcessAnnualReport(Announcement a) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 456
   at WatcherApp.VipMonitorService.Controller.ProcessARInQueueThread() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 362
   at WatcherApp.VipMonitorService.Controller._Lambda$__40-0() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 339
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()</StackTrace><ExceptionString>System.InvalidOperationException: Stack empty.
   at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
   at System.Collections.Generic.Stack`1.Pop()
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.EndMarkedContentC.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
   at WatcherApp.VipMonitorService.PDFHelper.FindVOI(List`1 voiList, String pdfFilename, Boolean searchFromLast, Int32 searchNumberOfPagesInPercent) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\PDFHelper.vb:line 59
   at WatcherApp.VipMonitorService.Controller.ProcessAnnualReport(Announcement a) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 456
   at WatcherApp.VipMonitorService.Controller.ProcessARInQueueThread() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 362
   at WatcherApp.VipMonitorService.Controller._Lambda$__40-0() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 339
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()</ExceptionString></Exception></TraceRecord>
1
  • The stack trace seems to indicate that you have an end-marked-content instruction without the matching begin-marked-content instruction. I'll later look into the PDF. Commented Jun 19, 2017 at 4:33

1 Answer 1

3

Is showing exception Stack empty at page 98 on this PDF, any ideas what is wrong?

The stack trace shows that the Stack empty occurs at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.EndMarkedContentC.Invoke. Thus, we should look at the starting and ending marked content operators:

tag BMC Begin a marked-content sequence terminated by a balancing EMC operator. tag shall be a name object indicating the role or significance of the sequence.

tag properties BDC Begin a marked-content sequence with an associated property list, terminated by a balancing EMC operator. tag shall be a name object indicating the role or significance of the sequence. properties shall be either an inline dictionary containing the property list or a name object associated with it in the Properties subdictionary of the current resource dictionary (see 14.6.2, “Property Lists”).

EMC End a marked-content sequence begun by a BMC or BDC operator.

(Table 320 – Marked-content operators, ISO 32000-1)

If you look at the BDC/BMC and EMC starts and ends of marked content on the page in question, you'll see:

/Artifact <</O /Layout >>BDC
EMC 
/Artifact <</O /Layout >>BDC  
EMC  
/Artifact <</O /Layout >>BDC  
EMC 
/Artifact <</BBox [0 33.8887 407.4289 0 ]/O /Layout >>BDC  
EMC 
EMC
...

Thus, there is a surplus EMC operator for which there is no BMC or BDC operator to end the marked content of.

Thus, this document is not a valid PDF; in particular, its marked content structure is broken.


That been said, it would be appropriate if iTextSharp would check the stack before the Pop and optionally either throw a more tangible exception or ignore the EMC operator.

Sign up to request clarification or add additional context in comments.

2 Comments

I think I might be running into a similar problem. How did you view the marked content on that pdf?
I used a browser of the internal structure of PDFs, like iText RUPS or PDFBox PDFDebugger. Adobe Acrobat Pro Preflight contains such a tool, too. Either of them can be used to inspect page content streams.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.