I have a vector vec where each element is a string representation of a sparse vector.
The output I want is a Pandas DataFrame with the following characteristics:
index:
vecindex
columns: sparse vector indices
values: sparse vector values
The sparse vectors are encoded with the format <feature_index>:<feature_value>, and records are separated by a single space.
Here are a few rows of example data:
vec = ["70:1.0000 71:1.0000 83:1.0000",
"3:2.0000 8:2.0000 9:3.0000",
"3:3.0000 185:1.0000 186:1.0000",
"3:1.0000 8:1.0000 289:1.0000"]
And here's my expected output:
185 186 289 3 70 71 8 83 9
index
0 NaN NaN NaN NaN 1.0000 1.0000 NaN 1.0000 NaN
1 NaN NaN NaN 2.0000 NaN NaN 2.0000 NaN 3.0000
2 1.0000 1.0000 NaN 3.0000 NaN NaN NaN NaN NaN
3 NaN NaN 1.0000 1.0000 NaN NaN 1.0000 NaN NaN
I have a working solution using from_records and pivot, but it seems clumsy and inefficient:
import pandas as pd
dense = pd.DataFrame()
for i, row in enumerate(vec):
tups = []
for entry in row.split():
tups.append(tuple([x for x in entry.split(':')]))
dense = pd.concat([dense,
(pd.DataFrame
.from_records(tups,
index=[i]*len(tups),
columns=['key','val'])
.reset_index()
.pivot(index='index',
columns='key',
values='val')
)
])
Can anyone suggest a cleaner approach, ideally one that makes better use of Pandas functionality?
The actual dataset I'm working with is rather large, so I'd like to take advantage of the performance optimizations in native Pandas, if possible.
Notes:
- The output index doesn't need to be labeled index.
- This doesn't have to be a pure Pandas solution. For example, I looked a bit at some of the sklearn methods for handling sparsity, but none of them quite seemed appropriate for solving this task.
- I'm not sure this matters, but after this operation I merge the resulting DataFrame (call it dense) with another DataFrame (call this one df), using dense and df indices as merge keys. So in this example, vec indices are [0,1,2,3], and the output dense needs to retain those indices.
vecin a different format? What is your source data set? Is it a sparse matrix?vecin the format it's in - it's coming to me from upstream processes that I don't have control over.