1

I have the following DataFrame:

import pandas as pd

data = {'id': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
        'location':['Milan', 'Paris', 'New York', 'Rome', 'Los Angeles', 'Berlin', 'Madrid'],
        'year': [2003,2004,2005, 2003, 2004, 2004, 2005]}

data = pd.DataFrame(data)

For each groupby('id'), I want the combinations among the city i at year t and all the cities at year t-1, t-2, ..., t-n.

The desired output:

data = {'id': ['A', 'A', 'A', 'A',
               'B', 'B', 'B', 'B', 'B', 'B'],
        'location':['Milan', 'Paris', 'New York', 'New York',
                    'Rome', 'Los Angeles', 'Berlin', 'Madrid','Madrid', 'Madrid'],
        'year': [2003, 2004, 2005, 2005,
                 2003, 2004, 2004, 2005, 2005, 2005],
       'comb': ['NaN', 'Milan', 'Milan','Paris',  
                'NaN', 'Rome', 'Rome', 'Rome','Los Angeles', 'Berlin']}

data = pd.DataFrame(data)

3 Answers 3

4

Self merge, then query:

N = 2
out = (data.merge(data, on='id', suffixes=['','_comb'])
           .query('0< year - year_comb <= @N')
      )

Output:

   id     location  year location_comb  year_comb
3   A        Paris  2004         Milan       2003
6   A     New York  2005         Milan       2003
7   A     New York  2005         Paris       2004
13  B  Los Angeles  2004          Rome       2003
17  B       Berlin  2004          Rome       2003
21  B       Madrid  2005          Rome       2003
22  B       Madrid  2005   Los Angeles       2004
23  B       Madrid  2005        Berlin       2004

Note: the above does not contain the first location for each id, which can be obtained by df.drop_duplicates('id'). So your final output would be

out = pd.concat([data.merge(data, on='id', suffixes=['','_comb'])
                     .query('0< year - year_comb <= @N'),
                 data.sort_values('year').drop_duplicates('id')] 
      )
Sign up to request clarification or add additional context in comments.

Comments

1

use a dictionary to get locations by year

data = {'id': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
        'location':['Milan', 'Paris', 'New York', 'Rome', 'Los Angeles', 'Berlin', 'Madrid'],
        'year': [2003,2004,2005, 2003, 2004, 2004, 2005]}

df = pd.DataFrame(data)
print(df)

locations_by_year = {}
for year in df['year'].unique():
    locations_by_year[year] = df[df['year'] == year]['location'].unique()
    
print(locations_by_year)

output:

{2003: array(['Milan', 'Rome'], dtype=object), 2004: array(['Paris', 'Los Angeles', 'Berlin'], dtype=object), 2005: array(['New York', 'Madrid'], dtype=object)}

locations by year:

df_grouped = df.groupby(['location'])
for name, group in df_grouped:
    print(name)
    print(group)

Comments

1

Generate the full cartesian product (all combinations of all rows of the original dataframe). Then filter by df.year_comb < df.year. This will also get rid of the rows with the first years for each id. These can be re-added to produce the rows in the output df with the NaN values, if so desired.

df = (pd.merge(data, data.rename(columns={"location": "comb", "year": "year_comb"}), on=["id"])
            .loc[lambda df: (df.year_comb < df.year)]
            .drop(["year_comb"], axis=1)
            )
# re-append the first years
data_first_years = data.sort_values(["year"]).groupby("id").first().reset_index()
df.append(data_first_years).sort_values(["id", "year"]).reset_index(drop=True)


# out:
  id     location  year         comb
0  A        Milan  2003          NaN
1  A        Paris  2004        Milan
2  A     New York  2005        Milan
3  A     New York  2005        Paris
4  B         Rome  2003          NaN
5  B  Los Angeles  2004         Rome
6  B       Berlin  2004         Rome
7  B       Madrid  2005         Rome
8  B       Madrid  2005  Los Angeles
9  B       Madrid  2005       Berlin

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.