0

I am trying to modify my script to copy files across using multiprocessing as an exercise for me learn more about multiprocessing in python

my main does this

if __name__ == "__main__":
    #get command line arguments
    cmdlineArgs = getCmdLineArguments()
    #get all the files in folder
    listOfFiles = getFiles(cmdlineArgs.sourceDirectory)
    #create dataframe of files which needs to be copied
    filesDF = createDF(listOfFiles, cmdlineArgs.destDirectory)
    processes = []
    lstOfDates = list(set(filesDF['date'].to_list()))
    lstOfDates.sort()
    # for dt in lstOfDates:
    #     copyFilesAcross([filesDF, [dt]])
    splitListOfDatesForProc = [(lstOfDates[i:i+3]) for i in range(0, len(lstOfDates), 3)]
    for dt in splitListOfDatesForProc:
        p = Process(target=copyFilesAcross, args=([filesDF, dt],))
        processes.append(p)
        p.start()

    for p in processes:
        p.join()

copyFilesAcross does this :

def copyFilesAcross(lst):
    #keep only the date provided as parameter
    df = lst[0]
    dt = lst[1]
    for d in dt:
        df = df[df.date == d]
        print("Processing date " + d + ' for PID: ', os.getpid())
        for index,row in df.iterrows():
            try:
                #print('Making directory ' + row['destination'])
                os.makedirs(row['destination'], exist_ok=True)
                shutil.copy(row['source'], row['destination'])
            except OSError as e:
                print('Failed to copy file ' + row['source'] + ' with error {0}'.format(e) )
            except:
                print("Unexpected error: ", sys.exc_info()[0])

output :

getFiles: Executed ...
getFiles: Creating empty list ...
getFiles: Concatenating files ...
Creating dataframe of files to be copied ...
Creating empty dataframe ...
Populating dataframe ...
Sorting data frame by date ...
Processing date 20180204 for PID:  35033 <- processed
Processing date 20180304 for PID:  35034 <- processed
Processing date 20180811 for PID:  35038 <- processed
Processing date 20180815 for PID:  35041 <- processed
Processing date 20180311 for PID:  35034 <- not processed
Processing date 20180724 for PID:  35034 <- not processed
Processing date 20180222 for PID:  35033 <- not processed
Processing date 20180303 for PID:  35033 <- not processed
Processing date 20180812 for PID:  35038 <- not processed
Processing date 20180813 for PID:  35038 <- not processed

Process finished with exit code 0

Without multiprocessing the script runs fine so I assume the issue is in the last 2 for loops in main, but I am not sure what I am doing wrong.

1 Answer 1

1

This isn't exactly a multiprocessing problem, you just have a bug in your code.

On the first loop iteration in copyFilesAcross, you overwrite df, and throw away every line other than the one that matches the first date in dt. On the next (and all subsequent) iteration of for d in dt:, you try to find a different date which won't exist, and you then overwrite df with an empty dataframe. When you call for index,row in df.iterrows():, there are no rows, so the loop never executes at all.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.