why is python-docx returning cells with text when should be empty?

Question

I have a docx document converted from pdf with pdf2docx library. The result seems good but if I load docx document with python-docx it creates a table with cells that contain texts instead of empty cells. The cells are filled with text from cells that is one row above the particular cells.

Table is look like this:

The table contains three rows. First row should contain cells with values [Barriere, Bonuslevel, Cap, Beobachtungszeitraum, Anfangl] and second and third rows should be empty except for last one column. But if can see in debug that empty cells contain text values like this:

Text Basiswert is in the first cell and in the sixth cell. The sixth cell should be empty. I opened an XML file of Docx document and there is everything ok so I think the problem is in python-docx library. Have anyone ever had the same problem?

Edit: This article comes very valuable:

https://python-docx.readthedocs.io/en/latest/dev/analysis/features/table/cell-merge.html

Basically the copied cells are continuation cells which indicates that cells are merged into horizontal or vertical spans but still I dont know how to read this information from python-docx API?

I don't think python-docx can just imagine data where none exists. Are you sure this isn't about e.g. merged columns or such? — AKX
– AKX, Commented Oct 4, 2021 at 13:38
What do you mean by 'merged columns'? How should I recognize it? — Mário Jaroš
– Mário Jaroš, Commented Oct 4, 2021 at 13:53
I do not know what are the lines in your command panel but the hex address of your 6th cell is exactly the same that the 1st. As far as your table at the beginning of your post has 5 columns, wouldn't it be a reason of your problem? — Christophe
– Christophe, Commented Oct 4, 2021 at 15:11
Yes, I have notice it, too but still have no clue whats going there. I wanted to ignore this cells with the same id but when I start to iterate over rows, reference on the cells also change and and are not the same anymore. — Mário Jaroš
– Mário Jaroš, Commented Oct 4, 2021 at 16:25

scanny · Accepted Answer · 2021-10-04 17:24:41Z

1

The addressing of table cells in python-docx is based on the grid layout. Basically the grid is all the cells before any cell merging is done. In the grid layout there are n rows and m columns and m * n cells; each row-column combination/intersection has a cell.

When you address a grid cell that is "merged" into some other cell, then the top-left member of the merged (rectangular) region is returned.

This means that some content is returned more than once if the table includes merged cells.

answered Oct 4, 2021 at 17:24

scanny

29.5k6 gold badges64 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mário Jaroš Over a year ago

Ok so question is how I recognize cells which will be merged ?

scanny Over a year ago

Have a search on "python-docx merged cells", this is one discussion with some ideas you might find useful.

Mário Jaroš Over a year ago

Yep, I have found something e.g. github.com/python-openxml/python-docx/issues/232 Attribute _tc of cell have information about span.

Collectives™ on Stack Overflow

why is python-docx returning cells with text when should be empty?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related