2

Hi I have a data like this

data = [{'name': 'root/folder1/f1/s1.csv' , 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f2/s2/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}, 
        {'name': 'root/folder2/f_1/f_2/f_3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f_1/f_2/f_3/f_4/f_5/file.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}, 
        {'name': 'root/folder2/f3/s3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder3/f3/s3/s4/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name' : 'root/folder3/f3/s3/s4/s5/s6/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}
       ]

I want to get files in each folder with minimum path for example in folder1 there is only 1 file then it will come same way. in folder2 2 path carrying a file for example root/folder2/f_1/f_2/f_3 and this path root/folder2/f_1/f_2/f_3/f_4/f_5 so I want to get minimum here . and a 3rd path aswell exist in folder2 'root/folder2/f3/s3/file.csv' but it will come as it is. and folder3 will as well get file with minimum path like root/folder3/f3/s3/s4/file4.csv

Expected output

data = [{'name': 'root/folder1/f1/s1.csv'},
        {'name': 'root/folder2/f2/s2/file.csv'}, 
        {'name': 'root/folder2/f_1/f_2/f_3/file.csv'},
        {'name': 'root/folder2/f3/s3/file.csv'},
        {'name': 'root/folder3/f3/s3/s4/file4.csv'}
       ]

Tried till now: I am trying to get paths with minimum slashes but not sure how to check for each sub folder etc for example did this

data_dict = {}
for item in data:
    dir = os.path.dirname(item['name'])
    if dir not in data_dict:
        item['count'] = 1
        data_dict[dir] = item
    else:
        count = data_dic[dir]['count'] + 1
        if item['last_modified'] > data_dict[dir]['last_modified']:
            data_dict[dir] = item
        data_dic[dir]['count'] = count

result = list(data_dict.values())
11
  • Try this: [line for line in data if len(line['name'].split("/")) <= 6] Commented Aug 16, 2022 at 7:07
  • @Alexander yes 1 for folder 3 and 1 for folder 1 . but 3 for folder 2 as you can see after folder2/ path is different Commented Aug 16, 2022 at 7:11
  • @Alexander I want to keep record of sub folders aswell. as for example there could be different files in that sub directories Commented Aug 16, 2022 at 7:14
  • @Alexander /root/folder2/file.csv only as I got file on root Commented Aug 16, 2022 at 7:23
  • @Alexander but if there is no file after immediate folder2 then will check all sub directories and will find files Commented Aug 16, 2022 at 7:24

1 Answer 1

2

Something like this would probably work.

import os
import datetime
from collections import Counter

data = [{'name': 'root/folder1/f1/s1.csv' , 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f2/s2/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f_1/f_2/f_3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f_1/f_2/f_3/f_4/f_5/file.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f3/s3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder3/f3/s3/s4/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name' : 'root/folder3/f3/s3/s4/s5/s6/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}
       ]

results = []

# this next line creates a list of all the paths minus their file name
# and counts them, which shows us how many duplicate paths there are
# so we can filter those based on the timestamp later on
paths = Counter([os.path.dirname(i['name']) for i in data])

for row in data:
    name = row["name"]
    path, filename = os.path.split(name) # split the path from filename

    # this next block is where we check if duplicate counter is greater
    # than 1 and if it is it compares the timestamps and either
    # ignores the entry if it isn't the most recent, or it allows
    # the loop to continue through the rest of the logic
    # if you want to allow to keep 2 files instead of 1 >>>
    if paths[path] > 1:
        # this `lst` contains only the duplicate files paths with different file names 
        lst = [i for i in data if i['name'].startswith(path)]
        # >>> you would run this next line again after removing the
        # the first result from the `lst` above, and allow the script
        # to continue for both of the collected output files.
        least = min(lst, key=lambda x: x['last_modified'])
        if least['name'] != name:
            continue

    # this next loop is where it simply goes through each parent 
    # directory and checks if it has already seen the exact path 
    # as the current path, if it has then it breaks and continues
    # to next item in `data` >>>
    while path:
        dirname = os.path.dirname(path) 
        if dirname in paths:
            break
        path = dirname
    # >>> if it doesn't then that means it is the shallowest copy
    # so it appends the full pathname to the results list
    else:
        results.append({'name': name})

print(results)

OUTPUT

[
  {'name': 'root/folder1/f1/s1.csv'}, 
  {'name': 'root/folder2/f2/s2/file.csv'}, 
  {'name': 'root/folder2/f_1/f_2/f_3/file.csv'}, 
  {'name': 'root/folder2/f3/s3/file.csv'}, 
  {'name': 'root/folder3/f3/s3/s4/file4.csv'}
]
Sign up to request clarification or add additional context in comments.

11 Comments

perfectly fine . just 1 case how I can handle 'root/s1.csv' if there is file on root I want to just get this one. and loop ends
yes these are paths
and there was a case if there are multiple files in a folder even with diff name then the latest date in last_modified will be picked. can you please check this ?
yes working . just same path time update thing remaning
thanks for helping . just to clarify 1 thing . where in code if I want to add count of the files ignored on basis of last_modified can be added . like if there was 2 files in a folder root/folder1/sample.csv root/folder1/sample2.csv and I keep root/folder1/sample2.csv on basis of last_modified so I want to keep count:2 in my object .
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.