Filter objects in python list on the basis of string value

Question

Hi I have a data like this

data = [{'name': 'root/folder1/f1/s1.csv' , 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f2/s2/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}, 
        {'name': 'root/folder2/f_1/f_2/f_3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f_1/f_2/f_3/f_4/f_5/file.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}, 
        {'name': 'root/folder2/f3/s3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder3/f3/s3/s4/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name' : 'root/folder3/f3/s3/s4/s5/s6/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}
       ]

I want to get files in each folder with minimum path for example in folder1 there is only 1 file then it will come same way. in folder2 2 path carrying a file for example root/folder2/f_1/f_2/f_3 and this path root/folder2/f_1/f_2/f_3/f_4/f_5 so I want to get minimum here . and a 3rd path aswell exist in folder2 'root/folder2/f3/s3/file.csv' but it will come as it is. and folder3 will as well get file with minimum path like root/folder3/f3/s3/s4/file4.csv

Expected output

data = [{'name': 'root/folder1/f1/s1.csv'},
        {'name': 'root/folder2/f2/s2/file.csv'}, 
        {'name': 'root/folder2/f_1/f_2/f_3/file.csv'},
        {'name': 'root/folder2/f3/s3/file.csv'},
        {'name': 'root/folder3/f3/s3/s4/file4.csv'}
       ]

Tried till now: I am trying to get paths with minimum slashes but not sure how to check for each sub folder etc for example did this

data_dict = {}
for item in data:
    dir = os.path.dirname(item['name'])
    if dir not in data_dict:
        item['count'] = 1
        data_dict[dir] = item
    else:
        count = data_dic[dir]['count'] + 1
        if item['last_modified'] > data_dict[dir]['last_modified']:
            data_dict[dir] = item
        data_dic[dir]['count'] = count

result = list(data_dict.values())

Try this: [line for line in data if len(line['name'].split("/")) <= 6] — Cow
– Cow, Commented Aug 16, 2022 at 7:07
@Alexander yes 1 for folder 3 and 1 for folder 1 . but 3 for folder 2 as you can see after folder2/ path is different — newbiee
– newbiee, Commented Aug 16, 2022 at 7:11
@Alexander I want to keep record of sub folders aswell. as for example there could be different files in that sub directories — newbiee
– newbiee, Commented Aug 16, 2022 at 7:14
@Alexander /root/folder2/file.csv only as I got file on root — newbiee
– newbiee, Commented Aug 16, 2022 at 7:23
@Alexander but if there is no file after immediate folder2 then will check all sub directories and will find files — newbiee
– newbiee, Commented Aug 16, 2022 at 7:24

Alexander · Accepted Answer · 2022-08-16 10:05:59Z

2

Something like this would probably work.

import os
import datetime
from collections import Counter

data = [{'name': 'root/folder1/f1/s1.csv' , 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f2/s2/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f_1/f_2/f_3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f_1/f_2/f_3/f_4/f_5/file.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder2/f3/s3/file.csv', 'last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name': 'root/folder3/f3/s3/s4/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)},
        {'name' : 'root/folder3/f3/s3/s4/s5/s6/file4.csv','last_modified': datetime.datetime(2022, 8, 4, 18, 43, 13)}
       ]

results = []

# this next line creates a list of all the paths minus their file name
# and counts them, which shows us how many duplicate paths there are
# so we can filter those based on the timestamp later on
paths = Counter([os.path.dirname(i['name']) for i in data])

for row in data:
    name = row["name"]
    path, filename = os.path.split(name) # split the path from filename

    # this next block is where we check if duplicate counter is greater
    # than 1 and if it is it compares the timestamps and either
    # ignores the entry if it isn't the most recent, or it allows
    # the loop to continue through the rest of the logic
    # if you want to allow to keep 2 files instead of 1 >>>
    if paths[path] > 1:
        # this `lst` contains only the duplicate files paths with different file names 
        lst = [i for i in data if i['name'].startswith(path)]
        # >>> you would run this next line again after removing the
        # the first result from the `lst` above, and allow the script
        # to continue for both of the collected output files.
        least = min(lst, key=lambda x: x['last_modified'])
        if least['name'] != name:
            continue

    # this next loop is where it simply goes through each parent 
    # directory and checks if it has already seen the exact path 
    # as the current path, if it has then it breaks and continues
    # to next item in `data` >>>
    while path:
        dirname = os.path.dirname(path) 
        if dirname in paths:
            break
        path = dirname
    # >>> if it doesn't then that means it is the shallowest copy
    # so it appends the full pathname to the results list
    else:
        results.append({'name': name})

print(results)

OUTPUT

[
  {'name': 'root/folder1/f1/s1.csv'}, 
  {'name': 'root/folder2/f2/s2/file.csv'}, 
  {'name': 'root/folder2/f_1/f_2/f_3/file.csv'}, 
  {'name': 'root/folder2/f3/s3/file.csv'}, 
  {'name': 'root/folder3/f3/s3/s4/file4.csv'}
]

edited Aug 16, 2022 at 10:05

answered Aug 16, 2022 at 8:06

Alexander

17.5k5 gold badges15 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

newbiee Over a year ago

perfectly fine . just 1 case how I can handle 'root/s1.csv' if there is file on root I want to just get this one. and loop ends

newbiee Over a year ago

yes these are paths

newbiee Over a year ago

and there was a case if there are multiple files in a folder even with diff name then the latest date in last_modified will be picked. can you please check this ?

newbiee Over a year ago

yes working . just same path time update thing remaning

newbiee Over a year ago

thanks for helping . just to clarify 1 thing . where in code if I want to add count of the files ignored on basis of last_modified can be added . like if there was 2 files in a folder root/folder1/sample.csv root/folder1/sample2.csv and I keep root/folder1/sample2.csv on basis of last_modified so I want to keep count:2 in my object .

|

Collectives™ on Stack Overflow

Filter objects in python list on the basis of string value

1 Answer 1

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related