2

I have a dataframe df

ID  active_seconds  domain  subdomain   search_engine   search_term
0120bc30e78ba5582617a9f3d6dfd8ca    35  vk.com  vk.com  None    None
0120bc30e78ba5582617a9f3d6dfd8ca    54  vk.com  vk.com  None    None
0120bc30e78ba5582617a9f3d6dfd8ca    34  vk.com  vk.com  None    None
16c28c057720ab9fbbb5ee53357eadb7    4   facebook.com    facebook.com    None    None
16c28c057720ab9fbbb5ee53357eadb7    4   facebook.com    facebook.com    None    None
16c28c057720ab9fbbb5ee53357eadb7    8   facebook.com    facebook.com    None    None
0120bc30e78ba5582617a9f3d6dfd8ca    16  megarand.ru megarand.ru None    None
0120bc30e78ba5582617a9f3d6dfd8ca    6   vk.com  vk.com  None    None

I need to change df. If to ID subdomain[i] == subdomain[i-1] I should union this string and active_seconds[i-1] + active_seconds[i]. From this df I want to get

ID  active_seconds  domain  subdomain   search_engine   search_term
0120bc30e78ba5582617a9f3d6dfd8ca    123 vk.com  vk.com  None    None
16c28c057720ab9fbbb5ee53357eadb7    16  facebook.com    facebook.com    None    None
0120bc30e78ba5582617a9f3d6dfd8ca    16  megarand.ru megarand.ru None    None
0120bc30e78ba5582617a9f3d6dfd8ca    6   vk.com  vk.com  None    None

What sould I use to do it?

2
  • Why weren't the last two lines joined together? Commented Jul 20, 2016 at 10:09
  • @unutbu because domain[i] != domain[i-1] Commented Jul 20, 2016 at 10:13

1 Answer 1

2

This get's real close. Not sure if getting that order correct is important to you.

Also, I made an assumption that I should groupby ID. This means that if the same ID spans across another ID and still in the same subdomain, I'll aggregate the active_seconds.

def proc_id(df):
    cond = df.subdomain != df.subdomain.shift()
    part = cond.cumsum()
    df_ = df.groupby(part).first()
    df_.active_seconds = df.groupby(part).active_seconds.sum()
    return df_

df.groupby('ID').apply(proc_id).reset_index(drop=True)

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

Can you say, if I have a list: lst_domain = ['vkontakte.ru', 'yandex.ru', 'vk.com', 'moscow.vk.com', 'city-link.ru'] and I need if domain is equal some of this list, I should use that, but if for example domain in df is equal msk.city-link.ru I should write, how it write in lst.domain. I mean in list I have city-link.ru but in df msk.city-link.ru. And I should rewrite it like in a list
This seems to be a different question? You should ask a new question asking about this specifically.
@piRSquared Is there something like a complex one line solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.