DOCX file to text file conversion using Python

Question

I wrote the following code to convert my docx file to text file. The output that I get printed in my text file is the last paragraph/part of the whole file and not the complete content. The code is as follows:

from docx import Document
import io
import shutil

def convertDocxToText(path):
    for d in os.listdir(path):
        fileExtension=d.split(".")[-1]
        if fileExtension =="docx":
            docxFilename = path + d
            print(docxFilename)
            document = Document(docxFilename)


# for printing the complete document
            print('\nThe whole content of the document:->>>\n')
            for para in document.paragraphs:
                textFilename = path + d.split(".")[0] + ".txt"
                with io.open(textFilename,"w", encoding="utf-8") as textFile:
                    #textFile.write(unicode(para.text))
                    x=unicode(para.text)
                    print(x) //the complete content gets printed by this line
                    textFile.write((x)) #after writing the content to text file only last paragraph is copied.
                #textFile.write(para.text)

path= "/home/python/resumes/"
convertDocxToText(path)

with io.open(textFilename,"w", encoding="utf-8") as textFile: is inside your for para in document.paragraphs: loop. This means you keep opening the file on each iteration in write-mode, wiping any existing content. You need to open the file once before running your loop i.e. put the for loop inside the with block, not the other way round. — roganjosh
– roganjosh, Commented Oct 9, 2018 at 10:51
@sharayusalunkhe is this code currently working for you ?, mine is giving errors even with the corrections,,, — X-Black...
– X-Black..., Commented Jun 12, 2019 at 2:04

sharayu salunkhe · Accepted Answer · 2018-10-09 12:32:41Z

3

the following is the solution for the above problem:

from docx import Document
import io
import shutil
import os

def convertDocxToText(path):
    for d in os.listdir(path):
        fileExtension=d.split(".")[-1]
        if fileExtension =="docx":
            docxFilename = path + d
            print(docxFilename)
            document = Document(docxFilename)
            textFilename = path + d.split(".")[0] + ".txt"
            with io.open(textFilename,"w", encoding="utf-8") as textFile:
                for para in document.paragraphs: 
                    textFile.write(unicode(para.text))

path= "/home/python/resumes/"
convertDocxToText(path)

answered Oct 9, 2018 at 12:32

sharayu salunkhe

1191 gold badge1 silver badge12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

aasmpro · Accepted Answer · 2018-10-09 11:09:33Z

Problem

as your code says in the last for loop:

        for para in document.paragraphs:
            textFilename = path + d.split(".")[0] + ".txt"
            with io.open(textFilename,"w", encoding="utf-8") as textFile:
                x=unicode(para.text)
                textFile.write((x))

for each paragraph in whole document, you try to open a file named textFilename so let's say you have a file named MyFile.docx in /home/python/resumes/ so the textFilename value that contains the path will be /home/python/resumes/MyFile.txt always in whole of for loop, so the problem is that you open the same file in w mode which is a Write mode, and will overwrite the whole file content.

Solution:

you must open the file once out of that for loop then try add paragraphs one by one to it.

Collectives™ on Stack Overflow

DOCX file to text file conversion using Python

2 Answers 2

Comments

Problem

Solution:

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Problem

Solution:

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related