4

I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.

I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.

Here is what I have so far to open the document(s):

import win32com.client as win32
import glob, os

word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True

for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
    print(infile)
    doc = word.Documents.Open(infile)

So at this point I can do something like

print(doc.Content.Text) 

And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.

The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.

Edit: code I was using to save the files from docx to txt:

for path, dirs, files in os.walk(r'mypath'):
    for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
        print("processing %s" % doc)
        wordapp.Documents.Open(doc)
        docastxt = doc.rstrip('docx') + 'txt'
        wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
        wordapp.ActiveDocument.Close()
4
  • What about telling Word to just save each document as a plain text file (with name os.path.splitext(infile)[0] + '.txt'), which you can then operate on in Python? It's a little hacky, but it does mean you don't have to learn Word's complicated document model. If that's not enough to go on, I can show you some (untested, but probably-close-to-correct) code to do the save-as and you can probably take it from there. Commented Nov 26, 2013 at 21:16
  • @abernet Thanks for the suggestion. That seems to be the best way to do it. I have already pieced together some code to save the .docx files as plain text, and I verified that all the hidden formatting was gone. Since the Word files have a lot of tables, the plain text has a ton of \n, but nothing else. So now it's just a matter of iterating through each line and picking out the strings I need. Fantastic! Commented Nov 26, 2013 at 21:28
  • 1
    Two last thoughts: First, if you have any non-ASCII text, you might want to use wdFormatUnicodeText (that is, 7) rather than one of the encoded plain-text formats. Just remember that what Microsoft calls "Unicode" is UTF-16-LE (with BOM), so you'd need to open(txtfile, 'utf-16'). Second, Word always saves with DOS \r\n line endings (it can also do classic Mac, but not Unix, because Microsoft). So, make sure to use universal newlines when reading the files in Python. Commented Nov 26, 2013 at 21:37
  • 1
    One comment on your new version: The Documents.open method returns the opened Document. Just use that; don't rely on the fact that the newly-opened document will usually become the ActiveDocument. Commented Nov 26, 2013 at 22:57

2 Answers 2

3

If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.

There are a lot of options here. Use tempfile to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs, WdSaveFormat, etc. docs to modify it.

wdFormatUnicodeText = 7

for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
    print(infile)
    doc = word.Documents.Open(infile)
    txtpath = os.path.splitext('infile')[0] + '.txt'
    doc.SaveAs(txtpath, wdFormatUnicodeText)
    doc.Close()
    with open(txtpath, encoding='utf-16') as f:
        process_the_file(f)

As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)

Sign up to request clarification or add additional context in comments.

2 Comments

@abernet, I edited my original post with the code snippet I am/was using to convert docx to txt files. Would I be okay using "FileFormat=win32com.client.constants.wdFormatUnicodeText" in my code for the same result?
@griffsterb: Yes—in fact, not just OK, but better. I just used the number because in some cases (and I don't remember the details, and don't have anything to test on) COM constants don't work properly with pywin32. If they work, use them.
0

oodocx is my fork of the python-docx module that is fully compatible with Python 3.3. You can use the replace method to do regular expression searches. Your code would look something like:

from oodocx import oodocx

d = oodocx.Docx('myfile.docx')
d.replace('searchstring', 'replacestring')
d.save('mynewfile.docx')

If you just want to remove strings, you can pass an empty string to the "replace" parameter.

1 Comment

The method that I worked out above with abernert seems to be working well for now, but I like this as well. I will keep it bookmarked for future use. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.