Convert string representations of sparse vectors into Pandas dataframe

Question

I have a vector vec where each element is a string representation of a sparse vector.
The output I want is a Pandas DataFrame with the following characteristics:

index: vec index
columns: sparse vector indices
values: sparse vector values

The sparse vectors are encoded with the format <feature_index>:<feature_value>, and records are separated by a single space.

Here are a few rows of example data:

vec = ["70:1.0000 71:1.0000 83:1.0000",
       "3:2.0000 8:2.0000 9:3.0000",
       "3:3.0000 185:1.0000 186:1.0000",
       "3:1.0000 8:1.0000 289:1.0000"]

And here's my expected output:

          185     186     289       3      70      71       8      83       9
index                                                                        
0         NaN     NaN     NaN     NaN  1.0000  1.0000     NaN  1.0000     NaN
1         NaN     NaN     NaN  2.0000     NaN     NaN  2.0000     NaN  3.0000
2      1.0000  1.0000     NaN  3.0000     NaN     NaN     NaN     NaN     NaN
3         NaN     NaN  1.0000  1.0000     NaN     NaN  1.0000     NaN     NaN

I have a working solution using from_records and pivot, but it seems clumsy and inefficient:

import pandas as pd

dense = pd.DataFrame()

for i, row in enumerate(vec):
    tups = []
    for entry in row.split(): 
        tups.append(tuple([x for x in entry.split(':')]))

    dense = pd.concat([dense,
                       (pd.DataFrame
                          .from_records(tups, 
                                        index=[i]*len(tups), 
                                        columns=['key','val'])
                          .reset_index()
                          .pivot(index='index', 
                                 columns='key', 
                                 values='val')
                       )
                     ])

Can anyone suggest a cleaner approach, ideally one that makes better use of Pandas functionality?
The actual dataset I'm working with is rather large, so I'd like to take advantage of the performance optimizations in native Pandas, if possible.

Notes:
- The output index doesn't need to be labeled index.
- This doesn't have to be a pure Pandas solution. For example, I looked a bit at some of the sklearn methods for handling sparsity, but none of them quite seemed appropriate for solving this task.
- I'm not sure this matters, but after this operation I merge the resulting DataFrame (call it dense) with another DataFrame (call this one df), using dense and df indices as merge keys. So in this example, vec indices are [0,1,2,3], and the output dense needs to retain those indices.

Do you have a chance to save your vec in a different format? What is your source data set? Is it a sparse matrix? — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Apr 28, 2017 at 8:26
Unfortunately, I'm stuck handling vec in the format it's in - it's coming to me from upstream processes that I don't have control over. — andrew_reece
– andrew_reece, Commented Apr 28, 2017 at 18:17

jezrael · Accepted Answer · 2017-04-28 05:23:08Z

1

I think you can use list comprehensions - first for splitting and then convert it to dicts with DataFrame constructor:

print ([dict([y.split(':') for y in (x.split())]) for x in vec])
[{'83': '1.0000', '70': '1.0000', '71': '1.0000'}, 
 {'8': '2.0000', '3': '2.0000', '9': '3.0000'}, 
 {'185': '1.0000', '186': '1.0000', '3': '3.0000'}, 
 {'289': '1.0000', '8': '1.0000', '3': '1.0000'}]

df = pd.DataFrame([dict([y.split(':') for y in (x.split())]) for x in vec])
print (df)
      185     186     289       3      70      71       8      83       9
0     NaN     NaN     NaN     NaN  1.0000  1.0000     NaN  1.0000     NaN
1     NaN     NaN     NaN  2.0000     NaN     NaN  2.0000     NaN  3.0000
2  1.0000  1.0000     NaN  3.0000     NaN     NaN     NaN     NaN     NaN
3     NaN     NaN  1.0000  1.0000     NaN     NaN  1.0000     NaN     NaN

Get DataFrame with NaNs and strings, so for numeric casting is necessary:

print (type(df.loc[0,'70']))
<class 'str'>

df = df.astype(float)
print (df)
   185  186  289    3   70   71    8   83    9
0  NaN  NaN  NaN  NaN  1.0  1.0  NaN  1.0  NaN
1  NaN  NaN  NaN  2.0  NaN  NaN  2.0  NaN  3.0
2  1.0  1.0  NaN  3.0  NaN  NaN  NaN  NaN  NaN
3  NaN  NaN  1.0  1.0  NaN  NaN  1.0  NaN  NaN

print (type(df.loc[0,'70']))
<class 'numpy.float64'>

edited Apr 28, 2017 at 5:23

answered Apr 28, 2017 at 5:16

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

andrew_reece Over a year ago

Thanks @jezrael. Your solution is about two orders of magnitude faster than mine (5.52ms vs 648ms). Really nice!

jezrael Over a year ago

Glad can help! Nice weekend!

Collectives™ on Stack Overflow

Convert string representations of sparse vectors into Pandas dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related