1

Suppose I have 100 files, and loop through all of them. In each file, there are records of several attributes: (the total number of attributes are not known before reading all the files)

Assume a simple case that after reading all the files, we obtain 20 different attributes and the following information:

File_001: a1, a3, a5, a2
File_002: a1, a3
File_003: a4
File_004: a4, a2, a6
File_005: a7, a8, a9
...
File_100: a19, a20

[Update] Or in another representation, where each line is a single match between one File and one attribute:

File_001: a1
File_001: a3
File_001: a5
File_001: a2
File_002: a1
File_002: a3
File_003: a4
File_004: a4
File_004: a2
File_004: a6
...
File_100: a19
File_100: a20

How can I generate the "reverse" statistics table, i.e.:

a1: File_001, File_002, File_006, File_083
a2: File_001, File_004
...
a20: File_099, File_100

How can I do it in Python (2.7.x)? (and with or without Pandas. I think Pandas might help)

2 Answers 2

4

UPDATE2: How can I generate the "reverse" statistics table

In [9]: df
Out[9]:
        file attr
0   File_001   a1
1   File_001   a3
2   File_001   a5
3   File_001   a2
4   File_002   a1
5   File_002   a3
6   File_003   a4
7   File_004   a4
8   File_004   a2
9   File_004   a6
10  File_100  a19
11  File_100  a20

In [10]: df.groupby('attr')['file'].apply(list)
Out[10]:
attr
a1     [File_001, File_002]
a19              [File_100]
a2     [File_001, File_004]
a20              [File_100]
a3     [File_001, File_002]
a4     [File_003, File_004]
a5               [File_001]
a6               [File_004]
Name: file, dtype: object

UPDATE:

How can I set output[202] as DataFrame?

new = (df.set_index('file')
         .apply(lambda x: pd.Series(x['attr']), axis=1)
         .stack()
         .reset_index(level=1, drop=True)
         .reset_index(name='attr')
         .groupby('attr')['file']
         .apply(list)
)

so I can export it to html or csv?

new.to_csv('/path/to/file.csv', index=False)

or

html_text = new.to_html(index=False)

Original answer:

Here is a pandas solution:

Original DF:

In [201]: df
Out[201]:
       file              attr
0  File_001  [a1, a3, a5, a2]
1  File_002          [a1, a3]
2  File_003              [a4]
3  File_004      [a4, a2, a6]
4  File_005      [a7, a8, a9]
5  File_100        [a19, a20]

Solution:

In [202]: %paste
(df.set_index('file')
   .apply(lambda x: pd.Series(x['attr']), axis=1)
   .stack()
   .reset_index(level=1, drop=True)
   .reset_index(name='attr')
   .groupby('attr')['file']
   .apply(list)
)
## -- End pasted text --

Output:

Out[202]:
attr
a1     [File_001, File_002]
a19              [File_100]
a2     [File_001, File_004]
a20              [File_100]
a3     [File_001, File_002]
a4     [File_003, File_004]
a5               [File_001]
a6               [File_004]
a7               [File_005]
a8               [File_005]
a9               [File_005]
Name: file, dtype: object
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks! It works perfectly! How can I set output[202] as DataFrame? so I can export it to html or csv? The result seems not to have method to export...
And if I have original DF with only one attribute on each line, e.g. File_001 a1 (newline) File_001 a2 (newline) File 002 a1, etc. How to adjust your compound code line to achieve the desire output (as a DF as well)?
@JimRaynor, And if I have original DF with only one attribute on each line, e.g. File_001 a1 (newline) - i don't get it. Could you post an output of print(df.head(10)) to your quesiton?
I updated the question for the part I asked. Thanks ;)
@JimRaynor, please see UPDATE2
0

While reading files; for each attribute you read, check a map to see whether keys includes the attribute. If not, add it, then add the file name that you have read that attribute from to values of that key and if the attribute is already a key of the map, then just add the filename as a value.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.