I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.
I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.
Here is what I have so far to open the document(s):
import win32com.client as win32
import glob, os
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
So at this point I can do something like
print(doc.Content.Text)
And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.
The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.
Edit: code I was using to save the files from docx to txt:
for path, dirs, files in os.walk(r'mypath'):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
print("processing %s" % doc)
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('docx') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()
os.path.splitext(infile)[0] + '.txt'), which you can then operate on in Python? It's a little hacky, but it does mean you don't have to learn Word's complicated document model. If that's not enough to go on, I can show you some (untested, but probably-close-to-correct) code to do the save-as and you can probably take it from there.wdFormatUnicodeText(that is,7) rather than one of the encoded plain-text formats. Just remember that what Microsoft calls "Unicode" is UTF-16-LE (with BOM), so you'd need toopen(txtfile, 'utf-16'). Second, Word always saves with DOS\r\nline endings (it can also do classic Mac, but not Unix, because Microsoft). So, make sure to use universal newlines when reading the files in Python.Documents.openmethod returns the openedDocument. Just use that; don't rely on the fact that the newly-opened document will usually become theActiveDocument.