0

I have block of text read from a PDF document, using the ItextSharp library(method: GetResultantText())

Consider the text is outlined/formatted in paragraphs:

*"Paragraph One.

Paragraph Two. ...

Paragraph n "*

Is there a way to use the C# StringBuilder object, or perhaps an alternate approach, to store the text while retaining the fomatting?: contains carriage returns and paragraphs etc. and store the value in a varchar field in SQL Server 08.

Ultimately I intend storing the text into a varchar field and would like to retain the line feeds, carriage return [basic fomatting metadata], otherwise the extracted text is a single block of text that isn't readabe when rendered.

I reckon invoking the toString() method on a StringBulder object removes all intermediate formatting characters in a text excecpt the terminating [newlinecharacter].

SimpleTextExtractionStrategy strategy;
            //StreamWriter writer = new StreamWriter("c:\\pdfOutput.txt");

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                try
                {
                    strategy = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
                    buffer.AppendLine(strategy.GetResultantText());
                    //writer.WriteLine(strategy.GetResultantText());
                }
                catch (IndexOutOfRangeException e) { }
            }

            pdfText = buffer.ToString();
            Console.WriteLine("* End: Text Extraction Process ...");
            return pdfText = buffer.ToString();

If I uncomment and output to a text file, the fomatting is retained. However if I save the resulting text into and entity defined as: All i get is a single block of text:

[System.Data.Linq.Mapping.Table(Name = "ReportsText")]
    public class ReportsText
    {
        [Column (IsDbGenerated = true, AutoSync=AutoSync.OnInsert)] 
        public int ID { get; set; }

        [Column(IsPrimaryKey = true, AutoSync = AutoSync.OnInsert)]
        public String image { get; set; } 

        [Column] public String announcement { get; set; }
    }

So pdfText is inteded to be stored into the annouuncement field. Cheers.

1
  • I don't think the formatting goes away... Commented May 31, 2011 at 5:06

2 Answers 2

2

I dont think that it should remove formatting and if it doing so Make use of "\r\n" after each paragraph and than store it.

Sign up to request clarification or add additional context in comments.

1 Comment

It turns out the formatting "\r\n" is indeed retained verified by fetching the value from the table and invoking Console.writeline(). Initially I was copying the value directoy from SQL Server Management studio and pasting into text file - which isn't the right way to verify. Thanks.
1

You are correct in saying that using StringBuilder in itself will remove formatting, and will retain only new line characters. If you really want to store a string with formatting information into the database, I would suggest storing it as a pre-defined format--like XML, RTF or even HTML, then retrieve it the same way in order to be fed to iTextSharp.

Another way I can think of is to generate the PDFs directly then store the binary stream into the database as nText or clob. This is not the best practice though.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.