Given a DataFrame like the one below:
id days cluster
0 aaa 0 0
1 bbb 0 0
2 ccc 0 1
3 ddd 0 1
4 eee 0 0
5 fff 0 1
6 ggg 1 0
7 hhh 1 1
8 iii 1 0
9 lll 1 1
10 mmm 1 1
11 aaa 1 3
12 bbb 1 3
My aim is to create a dictionary with keys tuple of elements of the id column and as values a list of elements of the cluster column if the two id have the same cluster value, all filtered by days column. i.e., if the days change but there are tuple of id elements that have the same cluster value, I want to add this value to my already existing list. The desired output is reported below:
{('aaa', 'bbb'): [0, 3],('aaa', 'eee'): [0], ('bbb', 'eee'): [0], ('ccc', 'ddd'): [1],
('ccc', 'fff'): [1], ('ddd', 'fff'): [1], ('ggg', 'iii'): [0],
('hhh', 'lll'): [1], ('hhh', 'mmm'): [1], ('lll', 'mmm'): [1]}
I obtained this result with the following snippet of code, but with million of rows it becomes too slow. How can I optimize the code?
y={}
for i in range(0, max(df.iloc[:,1]) + 1):
x = df.loc[df['days'] == i]
for j in range(0,l en(x)):
for z in range(1, len(x)):
if (x.iloc[z,0], x.iloc[j,0]) in y:
pass
else:
if (x.iloc[j,0], x.iloc[z,0]) not in y:
if x.iloc[j,0] != x.iloc[z,0] and x.iloc[j,2] == x.iloc[z,2]:
y[(x.iloc[j,0], x.iloc[z,0])] = [x.iloc[j,2]]
else:
if x.iloc[j,0] != x.iloc[z,0] and x.iloc[j,2] == x.iloc[z,2]:
y[(x.iloc[j,0], x.iloc[z,0])].append(x.iloc[j,2])
if the two 'id' have the same 'cluster' value.