Returning max of string after comparison with other sub-strings - Python

Question

I have a list that looks like this:

json_file_list = ['349148424_20180312071059_20190402142033.json','349148424_20180312071059_20190405142033.json','360758678_20180529121334_20190402142033.json']

and a empty list:

list2 = []

What I want to do is compare the characters up until the second underscore '_', and if they are the same I only want to append the max of the full string, to the new list. In the case above, the first 2 entries are duplicates (until second underscore) so I want to base the max off the numbers after the second underscore. So the final list2 would have only 2 entries and not 3

I tried this:

for row in json_file_list:
    if row[:24] == row[:24]:
        list2.append(max(row))
    else:
        list2.append(row)

but that is just returning:

['s', 's', 's']

Final output should be:

['349148424_20180312071059_20190405142033.json','360758678_20180529121334_20190402142033.json']

Any ideas? I also realize this code is brittle with the way I am slicing it (what happens if the string gets longer/shorter) so I need to come up with a better way to do that. Maybe base if off the second underscore instead. The strings will always end with '.json'

This is what your code is doing right now. row[:24] == row[:24] is always true because they are the same thing so you are doing list2.append(max(row)), which appends the character furthest down in the alphabet (in this case the 's' in .json) to the end of the list. — Kaiwen Chen
– Kaiwen Chen, Commented Apr 26, 2019 at 18:39
thanks Kaiwen, I understand that part and I need to figure out the slicing piece better, but even so shouldn't the final list be ['s','s', '<3rd full string>']? Im not doing a Max on the 3rd item in the original list since its not a duplicate of the first 2. — JD2775
– JD2775, Commented Apr 26, 2019 at 18:43
I am having trouble understanding exactly what you want as your output. If you could add that to your post, I could help answer. — Kaiwen Chen
– Kaiwen Chen, Commented Apr 26, 2019 at 18:46

Lante Dellarovere · Accepted Answer · 2019-04-26 20:25:38Z

1

I'd use a dictionary to do this:

from collections import defaultdict

d = defaultdict(list)
for x in json_file_list:
    d[tuple(x.split("_")[:2])].append(x)


new_list = [max(x) for x in d.values()]
new_list

Output:

['349148424_20180312071059_20190405142033.json',
 '360758678_20180529121334_20190402142033.json']

answered Apr 26, 2019 at 20:25

Lante Dellarovere

1,8582 gold badges9 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alec · Accepted Answer · 2019-04-26 19:05:21Z

1

The if statement in this snippet:

for row in json_file_list:
    if row[:24] == row[:24]:
        list2.append(max(row))
    else:
        list2.append(row)

always resolves to True. Think about it, how could row[:24] be different from itself? Given that it's resolving to True, it's adding the farthest letter in the alphabet (and in your string), s in this case, to list2. That's why you're getting an output of ['s', 's', 's'].

Maybe I'm understanding your request incorrectly, but couldn't you just append all the elements of the row to a list and then remove duplicates?

for row in json_file_list:
    for elem in row:
        list2.append(elem)
list2 = sorted(list(set(list2)))

answered Apr 26, 2019 at 19:05

Alec

9,7338 gold badges44 silver badges71 bronze badges

1 Comment

JD2775 Over a year ago

Thanks @alec_a, the top part makes sense to me now. The bottom part though, if it were basing duplicates on the entire string that would make sense. But as I mentioned, only up until the 2nd underscore is where the duplicates are found. Based on that, I want to only keep the max of those 2 strings (based on numbers AFTER the second underscore) If you look closely those numbers are different.

Jason Miller · Accepted Answer · 2019-04-26 20:45:01Z

1

I suppose you can splice what you want to compare, and use the built in 'set', to perform your difference:

set([x[:24] for x in json_file_list])
set(['360758678_20180529121334', '349148424_20180312071059'])

It would be a simple matter of joining the remaining text later on

list2=[]
for unique in set([x[:24] for x in json_file_list]):
  list2.append(unique + json_file_list[0][24:])

list2
['360758678_20180529121334_20190402142033.json',
 '349148424_20180312071059_20190402142033.json']

edited Apr 26, 2019 at 20:45

answered Apr 26, 2019 at 20:31

Jason Miller

335 bronze badges

2 Comments

JD2775 Over a year ago

thanks Jason, unfortunately the duplicate that returns is the MIN one, not the MAX one (if you look carefully at the numbers after the second underscore, compared to the original list). I see where you were going with it though....

Jason Miller Over a year ago

Oh I see! A simple sort would do the trick in that case. And I see @Lante Dellarovere has done that. Cheers!... No wait. Now I really do see it. Yes, those numbers are very difficult to spot.

Collectives™ on Stack Overflow

Returning max of string after comparison with other sub-strings - Python

3 Answers 3

Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related