I have a dictionary of values that follow this string pattern informationGain_$index$ and threshold_$index$. My goal is to retrieve the maximum informationGain_$index$ and threshold_$index$.
An example dictionary looks like so:
{'informationGain_0': 0.9949486404805016, 'threshold_0': 5.0, 'informationGain_1': 0.9757921620455572, 'threshold_1': 12.5, 'informationGain_2': 0.7272727272727273, 'threshold_2': 11.5, 'informationGain_3': 0.5509775004326937, 'threshold_3': 8.6, 'informationGain_4': 0.9838614413637048, 'threshold_4': 7.0, 'informationGain_5': 0.9512050593046015, 'threshold_5': 6.0, 'informationGain_6': 0.8013772106338303, 'threshold_6': 5.9, 'informationGain_7': 0.9182958340544896, 'threshold_7': 1.5, 'informationGain_8': 0.0, 'threshold_8': 9.0, 'informationGain_9': 0.6887218755408672, 'threshold_9': 7.8, 'informationGain_10': 0.9182958340544896, 'threshold_10': 2.1, 'informationGain_11': 0.0, 'threshold_11': 13.5}
I written code to generate the dataset.
def entropy_discretization(s):
I = {}
i = 0
while(uniqueValue(s)):
# Step 1: pick a threshold
threshold = s['A'].iloc[0]
# Step 2: Partititon the data set into two parttitions
s1 = s[s['A'] < threshold]
print("s1 after spitting")
print(s1)
print("******************")
s2 = s[s['A'] >= threshold]
print("s2 after spitting")
print(s2)
print("******************")
# Step 3: calculate the information gain.
informationGain = information_gain(s1,s2,s)
I.update({f'informationGain_{i}':informationGain,f'threshold_{i}': threshold})
print(f'added informationGain_{i}: {informationGain}, threshold_{i}: {threshold}')
s = s[s['A'] != threshold]
i += 1
print(I)
Given the example dataset, the maximum information gain is associated with threshold_0 and informationGain_0. I would like to find a general way of identifying these key values pairs from the dataset. Is there a way to search the dictionary such that I can return informationGain_*,threshold_* such that informationGain_* == max?
setofnamedtuples? Or just 2 parallel dicts where the key is just the index, or just a list ofnamedtuples if all indices will exist, or even a list of dicts that only have the keysinformationGainandthresholdif you don't like named tuples. All of those representations makes this task a lot easier.