1

I have a vector vec where each element is a string representation of a sparse vector.
The output I want is a Pandas DataFrame with the following characteristics:

index: vec index
columns: sparse vector indices
values: sparse vector values

The sparse vectors are encoded with the format <feature_index>:<feature_value>, and records are separated by a single space.

Here are a few rows of example data:

vec = ["70:1.0000 71:1.0000 83:1.0000",
       "3:2.0000 8:2.0000 9:3.0000",
       "3:3.0000 185:1.0000 186:1.0000",
       "3:1.0000 8:1.0000 289:1.0000"]

And here's my expected output:

          185     186     289       3      70      71       8      83       9
index                                                                        
0         NaN     NaN     NaN     NaN  1.0000  1.0000     NaN  1.0000     NaN
1         NaN     NaN     NaN  2.0000     NaN     NaN  2.0000     NaN  3.0000
2      1.0000  1.0000     NaN  3.0000     NaN     NaN     NaN     NaN     NaN
3         NaN     NaN  1.0000  1.0000     NaN     NaN  1.0000     NaN     NaN

I have a working solution using from_records and pivot, but it seems clumsy and inefficient:

import pandas as pd

dense = pd.DataFrame()

for i, row in enumerate(vec):
    tups = []
    for entry in row.split(): 
        tups.append(tuple([x for x in entry.split(':')]))

    dense = pd.concat([dense,
                       (pd.DataFrame
                          .from_records(tups, 
                                        index=[i]*len(tups), 
                                        columns=['key','val'])
                          .reset_index()
                          .pivot(index='index', 
                                 columns='key', 
                                 values='val')
                       )
                     ])

Can anyone suggest a cleaner approach, ideally one that makes better use of Pandas functionality?
The actual dataset I'm working with is rather large, so I'd like to take advantage of the performance optimizations in native Pandas, if possible.

Notes:
- The output index doesn't need to be labeled index.
- This doesn't have to be a pure Pandas solution. For example, I looked a bit at some of the sklearn methods for handling sparsity, but none of them quite seemed appropriate for solving this task.
- I'm not sure this matters, but after this operation I merge the resulting DataFrame (call it dense) with another DataFrame (call this one df), using dense and df indices as merge keys. So in this example, vec indices are [0,1,2,3], and the output dense needs to retain those indices.

2
  • Do you have a chance to save your vec in a different format? What is your source data set? Is it a sparse matrix? Commented Apr 28, 2017 at 8:26
  • Unfortunately, I'm stuck handling vec in the format it's in - it's coming to me from upstream processes that I don't have control over. Commented Apr 28, 2017 at 18:17

1 Answer 1

1

I think you can use list comprehensions - first for splitting and then convert it to dicts with DataFrame constructor:

print ([dict([y.split(':') for y in (x.split())]) for x in vec])
[{'83': '1.0000', '70': '1.0000', '71': '1.0000'}, 
 {'8': '2.0000', '3': '2.0000', '9': '3.0000'}, 
 {'185': '1.0000', '186': '1.0000', '3': '3.0000'}, 
 {'289': '1.0000', '8': '1.0000', '3': '1.0000'}]

df = pd.DataFrame([dict([y.split(':') for y in (x.split())]) for x in vec])
print (df)
      185     186     289       3      70      71       8      83       9
0     NaN     NaN     NaN     NaN  1.0000  1.0000     NaN  1.0000     NaN
1     NaN     NaN     NaN  2.0000     NaN     NaN  2.0000     NaN  3.0000
2  1.0000  1.0000     NaN  3.0000     NaN     NaN     NaN     NaN     NaN
3     NaN     NaN  1.0000  1.0000     NaN     NaN  1.0000     NaN     NaN

Get DataFrame with NaNs and strings, so for numeric casting is necessary:

print (type(df.loc[0,'70']))
<class 'str'>

df = df.astype(float)
print (df)
   185  186  289    3   70   71    8   83    9
0  NaN  NaN  NaN  NaN  1.0  1.0  NaN  1.0  NaN
1  NaN  NaN  NaN  2.0  NaN  NaN  2.0  NaN  3.0
2  1.0  1.0  NaN  3.0  NaN  NaN  NaN  NaN  NaN
3  NaN  NaN  1.0  1.0  NaN  NaN  1.0  NaN  NaN

print (type(df.loc[0,'70']))
<class 'numpy.float64'>
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @jezrael. Your solution is about two orders of magnitude faster than mine (5.52ms vs 648ms). Really nice!
Glad can help! Nice weekend!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.