1

I have a dataframe with 3 columns: equivalences, class, ch. I am using Python.

equivalences                             class                                              ch

ETICA CONTABIL                           A ÉTICA CONTÁBIL                                   40.0
ETICA CONTABIL                           A ÉTICA CONTÁBIL COM ENFOQUE                       40.0
BANCO DE DADOS                           GERENCIANDO SEU BD                                 40.0
AMBIENTE WEB                             APLICAÇÕES EM NUVENS                               40.0
AMBIENTE WEB                             ALTA DISPONIBILIDADE                               40.0
TECNOLOGIAS WEB                          PÁGINAS PARA INTERNET                              40.0
TECNOLOGIAS WEB                          PROGRAMAÇÃO WEB AVANÇADA                           40.0
TECNOLOGIAS WEB                          DESENVOLVENDO COM JS                               40.0
None                                     PROGRAMAÇÃO WEB                                    40.0

I need to get the pair combinations of equivalences, summing the ch of this pair. It should be something like this:

equivalences      class a                   class b                                  ch

ETICA CONTABIL    A ÉTICA CONTÁBIL          A ÉTICA CONTÁBIL COM ENFOQUE            80.0
BANCO DE DADOS    GERENCIANDO SEU BD        (null)                                  40.0
AMBIENTE WEB      APLICAÇÕES EM NUVENS      ALTA DISPONIBILIDADE                    80.0
TECNOLOGIAS WEB   PÁGINAS PARA INTERNET     PROGRAMAÇÃO WEB AVANÇADA                80.0
TECNOLOGIAS WEB   PÁGINAS PARA INTERNET     DESENVOLVENDO COM JS                    80.0
TECNOLOGIAS WEB   PROGRAMAÇÃO WEB AVANÇADA  DESENVOLVENDO COM JS                    80.0
(null)            PROGRAMAÇÃO WEB           (null)                                  40.0

I think I would have to use combinations itertools, but I have no clue how i group by equivalences to get distinct pairs. How can I do that?

4
  • 1
    The last row and the row with "BANCO DE DADOS" are not part of a pair. What's the exact logic for these cases? Commented Jul 21, 2020 at 20:22
  • The last row and the row with "BANCO DE DADOS" does not have equivalence between class a + class b. Btw, these cases can be excluded. Commented Jul 21, 2020 at 20:27
  • Excluded - do you mean dropped from the results? Commented Jul 21, 2020 at 20:28
  • Yes, they don't really matter since they have no equivalence. I thought of letting them in dataset to check lcases of wrong registers, like "technologies -1st períod" - "technologies" - "technologies -2nd" that are probably the same equivalence, but i will just work with these kinda situations after solving this first part Commented Jul 22, 2020 at 2:18

2 Answers 2

1

Here's a solution (in a few steps for clarity):

# create a cross product of classes per "equivalences"
t = pd.merge(df.assign(dummy = 1), df.assign(dummy=1), 
         on = ["dummy", "equivalences"])

# drop items in which the left and the right class are identical
t = t[t.class_x != t.class_y]

# drop duplicates such as x,y vs y,x
t.loc[t.class_x > t.class_y, ["class_x", "class_y"]] = \
    t.loc[t.class_x > t.class_y, ["class_x", "class_y"]].rename(columns = {"class_x": "class_y", "class_y": "class_x"})
t = t.drop_duplicates(subset = ["equivalences", "class_x", "class_y"])


t["ch"] = t.ch_x + t.ch_y
res = t.drop(["ch_x", "dummy", "ch_y"], axis=1)
print(res) 

==>

       equivalences                   class_x                       class_y    ch
1    ETICA CONTABIL          A ÉTICA CONTÁBIL  A ÉTICA CONTÁBIL COM ENFOQUE  80.0
6      AMBIENTE WEB      ALTA DISPONIBILIDADE          APLICAÇÕES EM NUVENS  80.0
10  TECNOLOGIAS WEB  PROGRAMAÇÃO WEB AVANÇADA         PÁGINAS PARA INTERNET  80.0
11  TECNOLOGIAS WEB      DESENVOLVENDO COM JS         PÁGINAS PARA INTERNET  80.0
14  TECNOLOGIAS WEB      DESENVOLVENDO COM JS      PROGRAMAÇÃO WEB AVANÇADA  80.0
Sign up to request clarification or add additional context in comments.

2 Comments

Wow. That's an amazing solution. It doesn't give me the duplicate pairs. Thank you so much !
Thanks:) Do you mind accepting the answer (clicking the grey checkmark and turning it to green) for future generations?
1

Let's assume df is your dataframe, get the pair combinations on a separate dataframe called pairs as below first using itertools:

import itertools

pairs = df.groupby('equivalences', )['class'].unique().to_frame()
func = lambda x: list(itertools.combinations(x, 2)) if len(x) > 1 else x
pairs['combinations'] = pairs['class'].map(func)

Then apply a nested for loop to output the results for each equivalences and class pairs as below:

records = []
for i in pairs.index:
    for j in pairs.loc[i, 'combinations']:
        if isinstance(j, tuple):
            records.append(
                {
                    'equivalences': i,
                    'class a': j[0],
                    'class b': j[1],
                    'ch': df.loc[(df['equivalences'] == i) & (df['class'].isin(j)), 'ch'].sum()
                }
            )
        else:
            records.append(
                {
                    'equivalences': i,
                    'class a': j,
                    'class b': 'null',
                    'ch': df.loc[(df['equivalences'] == i) & (df['class'] == j), 'ch'].sum()
                }
            )
            
    
pd.DataFrame.from_dict(records,)

Output:

    equivalences    class a class b ch
0   AMBIENTE WEB    APLICAÇÕES EM NUVENS    ALTA DISPONIBILIDADE    80
1   BANCO DE DADOS  GERENCIANDO SEU BD  null    40
2   ETICA CONTABIL  A ÉTICA CONTÁBIL    A ÉTICA CONTÁBIL COM ENFOQUE    80
3   TECNOLOGIAS WEB PÁGINAS PARA INTERNET   PROGRAMAÇÃO WEB AVANÇADA    80
4   TECNOLOGIAS WEB PÁGINAS PARA INTERNET   DESENVOLVENDO COM JS    80
5   TECNOLOGIAS WEB PROGRAMAÇÃO WEB AVANÇADA    DESENVOLVENDO COM JS    80
6   null    PROGRAMAÇÃO WEB null    40

On another note, don't forget to convert your null values to a string or any value other then None before applying groupby in the first place, as pandas groupby does not support grouping None yet. You can always convert your string null values to real None when you are done.

2 Comments

Thanks, it worked ! It gives me duplicates pairs, but I can deal with it fine.
No problem. I am not sure though if the above answer generates duplicate cases though, can you please check again? I have used itertools.combinations for pairs to avoid duplicate cases. Besides, if you look at the output of the answer, it is the same as the expected output in your question above. Anyways, please do not forget to upvote if you think the answer works.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.