1

I have a dictionary of values that follow this string pattern informationGain_$index$ and threshold_$index$. My goal is to retrieve the maximum informationGain_$index$ and threshold_$index$.

An example dictionary looks like so:

{'informationGain_0': 0.9949486404805016, 'threshold_0': 5.0, 'informationGain_1': 0.9757921620455572, 'threshold_1': 12.5, 'informationGain_2': 0.7272727272727273, 'threshold_2': 11.5, 'informationGain_3': 0.5509775004326937, 'threshold_3': 8.6, 'informationGain_4': 0.9838614413637048, 'threshold_4': 7.0, 'informationGain_5': 0.9512050593046015, 'threshold_5': 6.0, 'informationGain_6': 0.8013772106338303, 'threshold_6': 5.9, 'informationGain_7': 0.9182958340544896, 'threshold_7': 1.5, 'informationGain_8': 0.0, 'threshold_8': 9.0, 'informationGain_9': 0.6887218755408672, 'threshold_9': 7.8, 'informationGain_10': 0.9182958340544896, 'threshold_10': 2.1, 'informationGain_11': 0.0, 'threshold_11': 13.5}

I written code to generate the dataset.

def entropy_discretization(s):

    I = {}
    i = 0
    while(uniqueValue(s)):
        # Step 1: pick a threshold
        threshold = s['A'].iloc[0]

        # Step 2: Partititon the data set into two parttitions
        s1 = s[s['A'] < threshold]
        print("s1 after spitting")
        print(s1)
        print("******************")
        s2 = s[s['A'] >= threshold]
        print("s2 after spitting")
        print(s2)
        print("******************")
            
        # Step 3: calculate the information gain.
        informationGain = information_gain(s1,s2,s)
        I.update({f'informationGain_{i}':informationGain,f'threshold_{i}': threshold})
        print(f'added informationGain_{i}: {informationGain}, threshold_{i}: {threshold}')
        s = s[s['A'] != threshold]
        i += 1

    print(I)

Given the example dataset, the maximum information gain is associated with threshold_0 and informationGain_0. I would like to find a general way of identifying these key values pairs from the dataset. Is there a way to search the dictionary such that I can return informationGain_*,threshold_* such that informationGain_* == max?

1
  • is there any particular reason you are structuring your data like this instead of using say a set of namedtuples? Or just 2 parallel dicts where the key is just the index, or just a list of namedtuples if all indices will exist, or even a list of dicts that only have the keys informationGain and threshold if you don't like named tuples. All of those representations makes this task a lot easier. Commented Oct 17, 2021 at 21:44

4 Answers 4

2

Here is a solution using a custom key with max. It works even if the dictionary is not sorted. This is assuming the input dictionary is named d.

M = max((k for k in d if k.startswith('i')),
        key=lambda x: d[x])
T = f'threshold_{M.rsplit("_")[-1]}'
out = {M: d[M], T: d[T]}

Output:

{'informationGain_0': 0.9949486404805016, 'threshold_0': 5.0}

NB. I used a simple test on the dictionary keys to check those that start with i in order to identify the informationGain_X keys. If you have a more complex real life dictionary, you might want to update this to use a full match or any other way to make identification of the key non ambiguous.

Sign up to request clarification or add additional context in comments.

2 Comments

+1 this is a great answer. I have one small gripe which is that k.startswith('i') is a relatively weak conditional. I think regex would be a more appropriate choice: r"informationGain_(\d*)".
@ddejohn I hesitated to put a longer string and decided on the simplest solution given OP's data. Also a simpler conditional means a faster code. But I'll add a comment on that.
1

I've also found a way of doing this. It just took a few tries

    n = int(((len(I)/2)-1))
    print("Calculating maximum threshold")
    print("*****************************")
    maxInformationGain = 0
    maxThreshold       = 0 
    for i in range(0, n):
        if(I[f'informationGain_{i}'] > maxInformationGain):
            maxInformationGain = I[f'informationGain_{i}']
            maxThreshold       = I[f'threshold_{i}']

    print(f'maxThreshold: {maxThreshold}, maxInformationGain: {maxInformationGain}')

Comments

0

One way to do this is as follows:

assuming your dictionary name is d

informationGain_max = max(list(d.values())[::2])
threshold_max = max(list(d.values())[1::2])

this only works under the assumption that since python 3.6 standard dict maintains the order of insertions.

2 Comments

I'd significantly prefer a solution that uses filter or sort or list comprehension with a condition, making a list and slicing seems like such a bad idea with something that is logically an unordered set
@TadhgMcDonald-Jensen I agree. i think that even using ordered items like informationGain_i and threshold_i in a dict is already a bad idea because elements on a dict are not supposed to be ordered.
-1

Lets make a list, and each member of that list should be a tuple or list that contains two elements: first the information gain, and then the threshold. We can sort this list with either the .sort() method of the list or by using the sorted() function. The last tuple of the sorted list will contain the values you seek. If you are also interested in the index of these values then add their index as a third element of the tuples.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.