how to read a doc file in Python? some libraries showing errors while reading the file. sometimes we need to convert the doc file into docx at first and then use some library that supports this convention.
I found lots of articles and blogs talking about the "docx" file, but only some of them are talking about "doc" file. Found a base idea from this answer and want to share the basic idea with everyone. The code below reads a "doc" file and writes the lines into excel file that is separated with "\r".
import win32com.client # to work with doc file
import os # to find the absolute path
import xlsxwriter # For write in excel file
word = win32com.client.Dispatch("Word.Application")
word.visible = False
full_path = os.path.abspath("Test.doc")
wb = word.Documents.Open(full_path)
docs = word.ActiveDocument
docs = docs.Range().Text.split("\r") # reading the text and store the lines into a list
cnt=0
workbook = xlsxwriter.Workbook('Test.xlsx')
worksheet = workbook.add_worksheet("Test")
row = 0
for i in range(0, len(docs)): # computing each line of the word file
if "test" in docs[i]: # checking if this line have any test word
column = 0
'''Writing the line to the excel file'''
worksheet.write(row, column, docs[i])
cnt+=1
row += 1
workbook.close()
print("Number of test is: ",cnt)
word.Quit()