Generate pivot data in Python

Question

Suppose I have 100 files, and loop through all of them. In each file, there are records of several attributes: (the total number of attributes are not known before reading all the files)

Assume a simple case that after reading all the files, we obtain 20 different attributes and the following information:

File_001: a1, a3, a5, a2
File_002: a1, a3
File_003: a4
File_004: a4, a2, a6
File_005: a7, a8, a9
...
File_100: a19, a20

[Update] Or in another representation, where each line is a single match between one File and one attribute:

File_001: a1
File_001: a3
File_001: a5
File_001: a2
File_002: a1
File_002: a3
File_003: a4
File_004: a4
File_004: a2
File_004: a6
...
File_100: a19
File_100: a20

How can I generate the "reverse" statistics table, i.e.:

a1: File_001, File_002, File_006, File_083
a2: File_001, File_004
...
a20: File_099, File_100

How can I do it in Python (2.7.x)? (and with or without Pandas. I think Pandas might help)

MaxU - stand with Ukraine · Accepted Answer · 2016-07-03 18:29:14Z

4

UPDATE2: How can I generate the "reverse" statistics table

In [9]: df
Out[9]:
        file attr
0   File_001   a1
1   File_001   a3
2   File_001   a5
3   File_001   a2
4   File_002   a1
5   File_002   a3
6   File_003   a4
7   File_004   a4
8   File_004   a2
9   File_004   a6
10  File_100  a19
11  File_100  a20

In [10]: df.groupby('attr')['file'].apply(list)
Out[10]:
attr
a1     [File_001, File_002]
a19              [File_100]
a2     [File_001, File_004]
a20              [File_100]
a3     [File_001, File_002]
a4     [File_003, File_004]
a5               [File_001]
a6               [File_004]
Name: file, dtype: object

UPDATE:

How can I set output[202] as DataFrame?

new = (df.set_index('file')
         .apply(lambda x: pd.Series(x['attr']), axis=1)
         .stack()
         .reset_index(level=1, drop=True)
         .reset_index(name='attr')
         .groupby('attr')['file']
         .apply(list)
)

so I can export it to html or csv?

new.to_csv('/path/to/file.csv', index=False)

or

html_text = new.to_html(index=False)

Original answer:

Here is a pandas solution:

Original DF:

In [201]: df
Out[201]:
       file              attr
0  File_001  [a1, a3, a5, a2]
1  File_002          [a1, a3]
2  File_003              [a4]
3  File_004      [a4, a2, a6]
4  File_005      [a7, a8, a9]
5  File_100        [a19, a20]

Solution:

In [202]: %paste
(df.set_index('file')
   .apply(lambda x: pd.Series(x['attr']), axis=1)
   .stack()
   .reset_index(level=1, drop=True)
   .reset_index(name='attr')
   .groupby('attr')['file']
   .apply(list)
)
## -- End pasted text --

Output:

Out[202]:
attr
a1     [File_001, File_002]
a19              [File_100]
a2     [File_001, File_004]
a20              [File_100]
a3     [File_001, File_002]
a4     [File_003, File_004]
a5               [File_001]
a6               [File_004]
a7               [File_005]
a8               [File_005]
a9               [File_005]
Name: file, dtype: object

edited Jul 3, 2016 at 18:29

answered Jun 26, 2016 at 22:10

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jim Raynor Over a year ago

Thanks! It works perfectly! How can I set output[202] as DataFrame? so I can export it to html or csv? The result seems not to have method to export...

Jim Raynor Over a year ago

And if I have original DF with only one attribute on each line, e.g. File_001 a1 (newline) File_001 a2 (newline) File 002 a1, etc. How to adjust your compound code line to achieve the desire output (as a DF as well)?

MaxU - stand with Ukraine Over a year ago

@JimRaynor, And if I have original DF with only one attribute on each line, e.g. File_001 a1 (newline) - i don't get it. Could you post an output of print(df.head(10)) to your quesiton?

Jim Raynor Over a year ago

I updated the question for the part I asked. Thanks ;)

MaxU - stand with Ukraine Over a year ago

@JimRaynor, please see UPDATE2

Özgür Eroğlu · Accepted Answer · 2016-06-26 22:01:09Z

0

While reading files; for each attribute you read, check a map to see whether keys includes the attribute. If not, add it, then add the file name that you have read that attribute from to values of that key and if the attribute is already a key of the map, then just add the filename as a value.

answered Jun 26, 2016 at 22:01

Özgür Eroğlu

1,28010 silver badges16 bronze badges

Collectives™ on Stack Overflow

Generate pivot data in Python

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related