1

(Actual input CSV is comma-delimited as normal; I just showed my ideas as tables for ease of viewing.)

Here's an example of what I want to do using Python 2.7 (Pandas if it's better/easier, but I also like learning python logic and pandas skips over a lot, though I may have to learn it for stuff like this):

From

Price    Name    Text      Number    Choice   URL         Email
$40      Foo     Stuff     560       Y        www.a.com   [email protected]
$60      Foo     Things    280       N        www.a.com   [email protected]
$20      Foo     Other     120       Y        www.a.com   [email protected]
$25      John    Gals      1222      N        www.b.com   [email protected]
$100     Bar     Dudes     999       Y        www.c.com   [email protected]
$250     Bar     Guys      200       Y        www.c.com   [email protected]

To

Name    Price1    Price2   Price3   Text1    Text2    Text3    Number1    Number2    Number3    Choice1    Choice2    Choice3    URL         Email
Foo     $40       $60      $20      Stuff    Things   Other    560        280        120        Y          N          Y          www.a.com   [email protected]
John    $25                         Gals                       1222                             N                                www.b.com   [email protected]
Bar     $100      $250              Dudes    Guys              999        200                   Y          Y                     www.c.com   [email protected]

The order of the columns doesn't matter, though I would like to combine by the name column as a rule. (Hopefully I got them right, as just the example was a pain!)

For extra credit, I'd love to stop a cell from populating a new column if it's blank: e.g. if [email protected] was missing from row 2 in From above, To would look the same, not spawn a "Email2" column. Also, while the order of columns doesn't matter (I'm using this to populate a database that requires CSV input), the numbering has to match up! That is, for any given name, e.g. Foo above: $60, Things, 280, and N all have to be in columns marked "[OrigName]2" - and no Column2 should be populated while a column1 is blank for any given label.

This should be easy, but for completeness, I also need a column that adds up the filled Text columns (e.g., integer column "Number of Texts") and another one that adds up the number of "Price"s marked "Free" (e.g., "Number of Free Texts").

Thanks so much for any help - I'm already excited for what I'll learn from this, and further reading resources are always welcome!

3
  • 1
    are you trying to implement inner join ? Commented Nov 6, 2013 at 2:26
  • Transpose a matrix - zip(*matrix) Commented Nov 6, 2013 at 3:09
  • More on transposing a matrix? Commented Nov 7, 2013 at 0:10

2 Answers 2

2

In [252]:

import pandas as pd
import io

f = io.BytesIO("""Price    Name    Text      Number    Choice   URL         Email
40      Foo     Stuff     560       Y        www.a.com   [email protected]
60      Foo     Things    280       N        www.a.com   
20      Foo     Other     120       Y        www.a.com   [email protected]
25      John    Gals      1222      N        www.b.com   [email protected]
100     Bar     Dudes     999       Y        www.c.com   [email protected]
250     Bar     Guys      200       Y        www.c.com   [email protected]""")

df = pd.read_csv(f, delim_whitespace=True)
print df

output:

   Price  Name    Text  Number Choice        URL    Email
0     40   Foo   Stuff     560      Y  www.a.com  [email protected]
1     60   Foo  Things     280      N  www.a.com      NaN
2     20   Foo   Other     120      Y  www.a.com  [email protected]
3     25  John    Gals    1222      N  www.b.com  [email protected]
4    100   Bar   Dudes     999      Y  www.c.com  [email protected]
5    250   Bar    Guys     200      Y  www.c.com  [email protected]

In [253]:

split_columns = ["Price", "Text", "Number", "Choice"]

def split_func(df):
    return df.set_index(np.arange(1, df.shape[0]+1))

df2 = df[split_columns].groupby(df.Name).apply(split_func).unstack()
df2.columns = [name+str(i) for name, i in df2.columns]
print df2

output:

      Price1  Price2  Price3  Text1   Text2  Text3  Number1  Number2  Number3  \
Name                                                                            
Bar      100     250     NaN  Dudes    Guys    NaN      999      200      NaN   
Foo       40      60      20  Stuff  Things  Other      560      280      120   
John      25     NaN     NaN   Gals     NaN    NaN     1222      NaN      NaN   

     Choice1 Choice2 Choice3  
Name                          
Bar        Y       Y     NaN  
Foo        Y       N       Y  
John       N     NaN     NaN  

In [245]:

unique_columns = ["URL", "Email"]

def unique_func(s):
    return s.dropna().unique()[0]

df3 = df[unique_columns].groupby(df.Name).agg(unique_func)
print df3

output:

            URL    Email
Name                    
Bar   www.c.com  [email protected]
Foo   www.a.com  [email protected]
John  www.b.com  [email protected]

In [246]:

print pd.merge(df2, df3, left_index=True, right_index=True)

output:

      Price1  Price2  Price3  Text1   Text2  Text3  Number1  Number2  Number3  \
Name                                                                            
Bar      100     250     NaN  Dudes    Guys    NaN      999      200      NaN   
Foo       40      60      20  Stuff  Things  Other      560      280      120   
John      25     NaN     NaN   Gals     NaN    NaN     1222      NaN      NaN   

     Choice1 Choice2 Choice3        URL    Email  
Name                                              
Bar        Y       Y     NaN  www.c.com  [email protected]  
Foo        Y       N       Y  www.a.com  [email protected]  
John       N     NaN     NaN  www.b.com  [email protected]  
Sign up to request clarification or add additional context in comments.

2 Comments

This is extremely handy. I love the step by step so that I can learn! Thank you!
I'm getting return s.dropna().unique()[0] IndexError: index out of bounds
1

Using pandas, you can view what you want as a corrupted pivot table. You can get most of the way doing something like

import pandas as pd
df = pd.read_csv("stuff.dat",sep=r"\s+")
df["ranks"] = df.reset_index().groupby("Name")["index"].rank("first")
df2 = df.pivot_table(rows=["Name", "URL", "Email"],
                     cols="ranks",
                     aggfunc=lambda x: x, fill_value='')
df2.columns = [c[0] + str(int(c[1])) for c in df2.columns.get_values()]
df2 = df2.reset_index()

which produces

>>> print df2.to_string()
   Name        URL    Email Price1 Price2 Price3  Text1   Text2  Text3 Number1 Number2 Number3 Choice1 Choice2 Choice3
0   Bar  www.c.com  [email protected]   $100   $250         Dudes    Guys            999     200               Y       Y        
1   Foo  www.a.com  [email protected]    $40    $60    $20  Stuff  Things  Other     560     280     120       Y       N       Y
2  John  www.b.com  [email protected]    $25                 Gals                   1222                       N                

There are only a few tricks here. One is getting ranks, which we use to decide which column a value should go to. We reset_index() to get a column named "index" which looks like [0, 1, .., 5], groupby on the Name, and then rank each group using the method "first", which simply means 1 corresponds to the first value seen in a group, 2 the second, and so on.

IOW, we build a ranks column looking like

>>> df[["Name", "ranks"]]
   Name  ranks
0   Foo      1
1   Foo      2
2   Foo      3
3  John      1
4   Bar      1
5   Bar      2

Then we make a pivot table, using the identity function as the aggregation function because we're only reshaping. This produces a DataFrame with a MultiIndex for the column index:

                       Price              Text                Number           Choice      
ranks                      1     2    3      1       2      3      1    2    3      1  2  3
Name URL       Email                                                                       
Bar  www.c.com [email protected]  $100  $250       Dudes    Guys           999  200           Y  Y   
Foo  www.a.com [email protected]   $40   $60  $20  Stuff  Things  Other    560  280  120      Y  N  Y
John www.b.com [email protected]   $25              Gals                  1222                N      

(Note: this is actually how I might leave it, if this were the structure I wanted, rather than flattening the columns.)

Finally we collapse the columns:

>>> df2.columns
MultiIndex(levels=[[u'Price', u'Text', u'Number', u'Choice'], [1.0, 2.0, 3.0]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=[None, u'ranks'])
>>> df2.columns.get_values()
array([('Price', 1.0), ('Price', 2.0), ('Price', 3.0), ('Text', 1.0),
       ('Text', 2.0), ('Text', 3.0), ('Number', 1.0), ('Number', 2.0),
       ('Number', 3.0), ('Choice', 1.0), ('Choice', 2.0), ('Choice', 3.0)], dtype=object)

To handle the case of a missing email I'd ffill() based on the name, and to add extra summary columns I'd either use a columnar groupby or simply use a listcomp on the columns. But those are pretty straightforward, whereas the above is a bit tricky.

1 Comment

I'm getting "File "...\groupby.py", line 917, in _aggregate_series_pure_python | raise ValueError('Function does not reduce') ValueError: Function does not reduce

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.