4

I would like to sort a numerical list by the frequencies of the elements. (I found several ways to do it.)

During my exploration, I tried the below example.

Question : How does list.sort(key=list.count) works? Is it possible to use list.count() during list.sort()?

I read that the key-function is evaluated for each element of the list before the sort and those values are used for the comparisons during the sort.

Also, I read somewhere that during sort() the list is kind of locked. (sorry, I can't find the reference now - I've read quite a lot of blogs and tutorials on this topic in the last few hours, Python documentation and How-To Sort included)

This is the example

### Python 3.7 ###

data = [22, 11, 33, 99, 88, 77, 22, 44, 55, 44, 66, 22]

# sort by value
data.sort()
print(data)
>>> [11, 22, 22, 22, 33, 44, 44, 55, 66, 77, 88, 99]

# sort by frequency, i.e. list.count()
data.sort(key=data.count)
print(data)
>>> [11, 22, 22, 22, 33, 44, 44, 55, 66, 77, 88, 99]
# expected >>> [11, 33, 55, 66, 77, 88, 99, 44, 44, 22, 22, 22]
# but no change, the value-sorted list is printed

# or
data.sort(key=lambda e: data.count(e))
print(data)
>>> [11, 22, 22, 22, 33, 44, 44, 55, 66, 77, 88, 99]
# expected >>> [11, 33, 55, 66, 77, 88, 99, 44, 44, 22, 22, 22]
# but no change, the value-sorted list is printed

note: no error message.

As an addition, I would like to mention that the following works as expected

max(data, key=data.count)

And, of course, this also gives the expected result

print(sorted(data, key=data.count))
>>> [11, 33, 55, 66, 77, 88, 99, 44, 44, 22, 22, 22]

By the documentation sorted() and sort() should return the same result, don't they?

Thanks for your insights!

EDIT:

By the documentation - as I understood :

  1. sort() takes the key-function and feeds the key-function with individual members of the list

    -> the calculated results are the number of occurrences of each element (equivalent element results with equal calculated result, as their frequency is the same in the list)

    : I'm not experienced to debug this deep in Python

    : itself data.count() returns the appropriate list of the frequencies, that I checked

  2. saves / caches the calculated results

    : that's the foundation of its efficiency

  3. uses the cached calculated results(!) to determine the order of the original list

    -> the least frequent elements are at the front of the list, and the most frequent at he back

    !!! this is not happening...

  4. saves the list in its new order in-place

    !!! ...OR this is not happening.

Additionally, as far as I understood (though not sure), somewhere during this process sort() 'locks away' the original list from other usage/access (and somewhere releases the lock - something about multi-threaded applications was in the explanation, as I recall).

IMPORTANT :

I'm not looking for a solution or code to sort the list - I'd appreciate an explanation of what's happening:

  • Why the result is the actual returned list and not my expectation?

  • In comparison, why sorted() works as expected?

6
  • @juanpa.arrivillaga I don't understand what's going on in the background - certainly it's not what I suspected, that's the reason I included the actual output and the expected output. Extended the question with the explanation why I expected that. Commented Jul 19, 2020 at 13:15
  • Ah, I didn't read carefully enough. that is odd and looks like a bug. As an aside, you should note that this is very inefficient, it degrades the sorting algorithm to O(N^2) instead of O(N * log N) Commented Jul 19, 2020 at 19:16
  • To be picky on wording :-) Actually, the efficiency of the sorting algorithm's implementation stays the same. That's the calculation of the elements' 'weight' (before the sort actually starts) - some other examples are given here among the answers, and I assume that also the developers did a good job when they implemented list.count(). Something to find out with looking in to the source code and/or benchmarking. Commented Jul 20, 2020 at 11:27
  • list.count is an O(n) operation, so just calling [x.count() for x in some_list] is an O(N**2) operation. So it definitely impacts the efficiency of the algorithm. If you don't want to do that, do something like from collections import Counter; counts = Counter(some_list); some_list.sort(key=counts.get) Commented Jul 20, 2020 at 11:29
  • I agree with you, and thanks for the explanation. I just wanted to be more accurate (than enough accurate, maybe) as separating the frequency calculation from sorting as they happen one after the other.. Commented Jul 20, 2020 at 12:29

3 Answers 3

3

OK, according to the documentation:

CPython implementation detail: While a list is being sorted, the effect of attempting to mutate, or even inspect, the list is undefined. The C implementation of Python makes the list appear empty for the duration, and raises ValueError if it can detect that the list has been mutated during a sort.

If the bolded part is the case, then data.count will return 0 for any element, and the sort won't change the order of the list.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for finding the relevant lines in the documentation! I had read it several times, did not figure out what it exactly means in this case.
@all DOCUMENTATION - There might be a notice at CPython implementation detail? e.g. "This is also relevant for the list's own methods as the passed key-function, e.g. list.sort(key=list.count) will not sort the list by the frequency of its elements (because already at the beginning of the sort the list.count() sees an empty list)." - Also at Odd and Ends something similar. - Emphasise this (peculiar) difference in the behaviour of list.sort() and sorted().
3

This is an interesting question, I do not have the full answer, as it is somewhere in the source code here: https://github.com/python/cpython/blob/master/Objects/listobject.c

However, you can have a part of the answer by using the following function as a key:

def count(e):
   print(data)
   return data.count(e)

If you do that, you will see that it only prints "[]". This means that somehow during the inplace sorting process, probably in order to avoid messing with you list, your list now points to an empty list (even though the reference itself, data, has not changed). Thus data.count(e) is always equal to 0, and your list is left unchanged.

Thus the only way to use your list during the inplace sorting process is to copy the list, you can do for instance:

data.sort(key=data.copy().count)

I will add that this does not increase very much the cost of the whole process to copy the list, as the line above is already O(n² log(n)) O(n²) (thanks to Kelly Bundy for pointing this out) . Indeed, this is a very bad idea to call count on every element of the list. An efficient O(n log(n)) way to do that would be:

nb_occ={}
for x in data:
    nb_occ[x]=nb_occ.get(x,0)+1
data.sort(key=nb_occ.__getitem__)

EDIT: See the answer from juanpa.arrivillaga, this behaviour is actually documented in the sort method documentation.

5 Comments

Yes, this is actually documented: docs.python.org/3/library/stdtypes.html#list.sort
Indeed ! Thanks for the information. I upvoted your answer but unfortunately I do not have enough reputation for it to be taken into account ;)
I would ad as a comment that the documentation emphasises that the advantage of list.sort() over sorted() is when the list is big and memory might be a concern, and there's no need for the initial list - because sorted() creates a new list <object> (beside the initial list <object>) with the same number of elements and with the same (shallow) occupied memory and list.sorted() mutates the initial list <object>.
It's not just O(n² log(n)) but O(n²). First O(n²) to get the key values, then O(n log n) to sort. Together O(n²).
@KellyBundy indeed you are perfectly right, it would be O(n² log(n)) if the key function was applied for every comparison, which is not the case, I will correct it in my answer.
0
data = [22, 11, 33, 99, 88, 77, 22, 44, 55, 44, 66, 22]
data.sort()
a,s,z,p=[],[],[],{}
for i in data:
    if i not in s:
        s.append(i)
        t=data.count(i)
        a.append(t)
for i in range(len(a)):
    p[s[i]]=a[i]
for u,m in sorted(p.items(),key=lambda x: x[1]):
    z.append(u)
print(z)

4 Comments

Would you, please, add and explanation!
Actually what I have done is that I have created 2 lists that is a and s in which I am appending the data list members and their frequency. Then afterwards I am adding these lists(a and s) as key-value pair to dict(p). now the list which has come isn't sorted so now I have used builtin sorted function to sort the dict according to values.
Thanks for the explanation! Is this an efficient way to do the by-frequency sort (compared to other examples below and above here) or is it an example?
I can't say that it is the efficient way but still....it is a good way to solve problems like this as, beginner can easily understand this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.