How to parse and preserve text formatting (Python-Docx)?

Question

I'm using Python-Docx to export all the data from a 500-page Docx file into a spreadsheet using pandas. So far so good except that the process is removing all character styles. I have written the following to preserve superscript, but I can't seem to get it working.

for para in document.paragraphs:
    content = para.text

    for run in para.runs:
        if run.font.superscript:
            r.font.superscript = True

        r = para.add_run(run.text)
        scripture += r.text

My Input text might me, for example:

Genesis 1:1 ¹ In the beginning God created the heavens and the earth.

But my output into the Xlsx file is:

Genesis 1:1 1 In the beginning God created the heavens and the earth. (Still losing the superscript formatting).

How do I preserve the font.style of each run for export? Perhaps more specifically, how do I get the text formatting from each run to be encoded into the "scripture" string?

Any help is greatly appreciated!

scanny · Accepted Answer · 2021-09-10 18:41:01Z

You cannot encode font information in a str object. A str object is a sequence of characters and that's that. It cannot indicate "make these five characters bold and the following three characters italic. There's just no place to put that sort of thing and the str data type is not made for that job.

Font (character-formatting) information must be stored in a container object of some sort. In Word, that's a run. It HTML it can be a <span> element. If you want character-formatting in your spreadsheet, you'll need to know how character formatting is stored in the target format (Excel maybe) and then apply it to text in that export format on a run-by-run basis.

There are some other problems with your code you should be aware of:

the r in r.font.superscript = True is being used before being defined. The r = para.add_run(run.text) line would need to appear prior to that line to avoid problems. I wouldn't bother here because it's not actually doing anything here it turns out, but names need to be defined before use.
You are doubling the size of the source paragraph by adding runs to it. This part actually contributes nothing because you then call run.text which as we mentioned cannot contain any character-formatting information and so it gets stripped back out.

The same result as your current code can be achieved by this:

scripture = "".join(p.text for p in document.paragraphs)

but I think you'll at approach like:

Parse out bits that go in separate cells
Within the text that goes into a single cell, write a "rich-text" cell something like that described here for XlsxWriter: https://xlsxwriter.readthedocs.io/example_rich_strings.html

Collectives™ on Stack Overflow

How to parse and preserve text formatting (Python-Docx)?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related