0

I have a list that looks like this:

json_file_list = ['349148424_20180312071059_20190402142033.json','349148424_20180312071059_20190405142033.json','360758678_20180529121334_20190402142033.json']

and a empty list:

list2 = []

What I want to do is compare the characters up until the second underscore '_', and if they are the same I only want to append the max of the full string, to the new list. In the case above, the first 2 entries are duplicates (until second underscore) so I want to base the max off the numbers after the second underscore. So the final list2 would have only 2 entries and not 3

I tried this:

for row in json_file_list:
    if row[:24] == row[:24]:
        list2.append(max(row))
    else:
        list2.append(row)

but that is just returning:

['s', 's', 's']

Final output should be:

['349148424_20180312071059_20190405142033.json','360758678_20180529121334_20190402142033.json']

Any ideas? I also realize this code is brittle with the way I am slicing it (what happens if the string gets longer/shorter) so I need to come up with a better way to do that. Maybe base if off the second underscore instead. The strings will always end with '.json'

4
  • 1
    This is what your code is doing right now. row[:24] == row[:24] is always true because they are the same thing so you are doing list2.append(max(row)), which appends the character furthest down in the alphabet (in this case the 's' in .json) to the end of the list. Commented Apr 26, 2019 at 18:39
  • thanks Kaiwen, I understand that part and I need to figure out the slicing piece better, but even so shouldn't the final list be ['s','s', '<3rd full string>']? Im not doing a Max on the 3rd item in the original list since its not a duplicate of the first 2. Commented Apr 26, 2019 at 18:43
  • I am having trouble understanding exactly what you want as your output. If you could add that to your post, I could help answer. Commented Apr 26, 2019 at 18:46
  • done. sorry for the confusion Commented Apr 26, 2019 at 18:48

3 Answers 3

1

I'd use a dictionary to do this:

from collections import defaultdict

d = defaultdict(list)
for x in json_file_list:
    d[tuple(x.split("_")[:2])].append(x)


new_list = [max(x) for x in d.values()]
new_list

Output:

['349148424_20180312071059_20190405142033.json',
 '360758678_20180529121334_20190402142033.json']
Sign up to request clarification or add additional context in comments.

Comments

1

The if statement in this snippet:

for row in json_file_list:
    if row[:24] == row[:24]:
        list2.append(max(row))
    else:
        list2.append(row)

always resolves to True. Think about it, how could row[:24] be different from itself? Given that it's resolving to True, it's adding the farthest letter in the alphabet (and in your string), s in this case, to list2. That's why you're getting an output of ['s', 's', 's'].

Maybe I'm understanding your request incorrectly, but couldn't you just append all the elements of the row to a list and then remove duplicates?

for row in json_file_list:
    for elem in row:
        list2.append(elem)
list2 = sorted(list(set(list2)))

1 Comment

Thanks @alec_a, the top part makes sense to me now. The bottom part though, if it were basing duplicates on the entire string that would make sense. But as I mentioned, only up until the 2nd underscore is where the duplicates are found. Based on that, I want to only keep the max of those 2 strings (based on numbers AFTER the second underscore) If you look closely those numbers are different.
1

I suppose you can splice what you want to compare, and use the built in 'set', to perform your difference:

set([x[:24] for x in json_file_list])
set(['360758678_20180529121334', '349148424_20180312071059'])

It would be a simple matter of joining the remaining text later on

list2=[]
for unique in set([x[:24] for x in json_file_list]):
  list2.append(unique + json_file_list[0][24:])

list2
['360758678_20180529121334_20190402142033.json',
 '349148424_20180312071059_20190402142033.json']

2 Comments

thanks Jason, unfortunately the duplicate that returns is the MIN one, not the MAX one (if you look carefully at the numbers after the second underscore, compared to the original list). I see where you were going with it though....
Oh I see! A simple sort would do the trick in that case. And I see @Lante Dellarovere has done that. Cheers!... No wait. Now I really do see it. Yes, those numbers are very difficult to spot.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.