Loading Numpy array to single Pandas DataFrame colums

Question

I am using PySpark and am trying to use a CSV to store my data. I converted my Numpy array I had into a DataFrame and it was formatted like so:

label   |     0    1     2     4    ...    768
---------------------------------------
  1     |   0.12  0.23  0.31  0.72  ...   0.91

and so on, splitting each value of a 'row vector' per se in the array into individual columns. That format is not compatible with Spark, it needs the features all in one column. Is there a way I can load my array into a DataFrame in that format? For example:

label   |     Features
------------------------------------------
  1     |   [0.12,0.23,0.31,0.72,...,0.91]

I tried following advice from this thread, which detailed merging the columns using Spark API, but when loading my labels in, I get an error because the labels become part of a vector and not a string or int value.

RSHAP · Accepted Answer · 2020-09-30 22:43:36Z

1

I don't know anything about spark, but of you want a dataframe with a column of lists just do df['features'] = SOME_2D_LIST_OF_LISTS

data = [[1,2,3],[4,5,6],[7,8,9]]
df = pd.DataFrame()
df['Features'] = data # now you have a column of lists
# If for whatever reason you want each row value to itself be a numpy array add
df['Features'] = df['Features'].map(np.array)

if the data is already a numpy array just do df['Features'] = data.tolist().

answered Sep 30, 2020 at 22:43

RSHAP

2,4465 gold badges31 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Cameron Riddell · Accepted Answer · 2020-09-30 22:45:36Z

Should do the trick, note that I decided to use integers over floats for better readability:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(20, 30, size=30).reshape(3, 10))
df.insert(0, "label", [1,2,3])

print(df)

   label   0   1   2   3   4   5   6   7   8   9
0      1  26  27  25  29  20  23  26  25  22  23
1      2  20  20  26  25  23  23  26  24  27  23
2      3  24  22  24  22  26  23  27  22  26  23

Select all of your feature columns (I used iloc here) and convert them to a list of lists.

features = df.iloc[:, 1:].to_numpy().tolist()

print(features)
[[26, 27, 25, 29, 20, 23, 26, 25, 22, 23],
 [20, 20, 26, 25, 23, 23, 26, 24, 27, 23],
 [24, 22, 24, 22, 26, 23, 27, 22, 26, 23]]

Then make a new dataframe with your labels and the new features:

new_df = pd.DataFrame({
    "label": df["label"],
    "features": features
})

print(new_df)

   label                                  features
0      1  [26, 27, 25, 29, 20, 23, 26, 25, 22, 23]
1      2  [20, 20, 26, 25, 23, 23, 26, 24, 27, 23]
2      3  [24, 22, 24, 22, 26, 23, 27, 22, 26, 23]

Collectives™ on Stack Overflow

Loading Numpy array to single Pandas DataFrame colums

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related