1

I am using PySpark and am trying to use a CSV to store my data. I converted my Numpy array I had into a DataFrame and it was formatted like so:

label   |     0    1     2     4    ...    768
---------------------------------------
  1     |   0.12  0.23  0.31  0.72  ...   0.91

and so on, splitting each value of a 'row vector' per se in the array into individual columns. That format is not compatible with Spark, it needs the features all in one column. Is there a way I can load my array into a DataFrame in that format? For example:

label   |     Features
------------------------------------------
  1     |   [0.12,0.23,0.31,0.72,...,0.91]

I tried following advice from this thread, which detailed merging the columns using Spark API, but when loading my labels in, I get an error because the labels become part of a vector and not a string or int value.

2 Answers 2

1

I don't know anything about spark, but of you want a dataframe with a column of lists just do df['features'] = SOME_2D_LIST_OF_LISTS

data = [[1,2,3],[4,5,6],[7,8,9]]
df = pd.DataFrame()
df['Features'] = data # now you have a column of lists
# If for whatever reason you want each row value to itself be a numpy array add
df['Features'] = df['Features'].map(np.array)

if the data is already a numpy array just do df['Features'] = data.tolist().

Sign up to request clarification or add additional context in comments.

Comments

0

Should do the trick, note that I decided to use integers over floats for better readability:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(20, 30, size=30).reshape(3, 10))
df.insert(0, "label", [1,2,3])

print(df)

   label   0   1   2   3   4   5   6   7   8   9
0      1  26  27  25  29  20  23  26  25  22  23
1      2  20  20  26  25  23  23  26  24  27  23
2      3  24  22  24  22  26  23  27  22  26  23

Select all of your feature columns (I used iloc here) and convert them to a list of lists.

features = df.iloc[:, 1:].to_numpy().tolist()

print(features)
[[26, 27, 25, 29, 20, 23, 26, 25, 22, 23],
 [20, 20, 26, 25, 23, 23, 26, 24, 27, 23],
 [24, 22, 24, 22, 26, 23, 27, 22, 26, 23]]

Then make a new dataframe with your labels and the new features:

new_df = pd.DataFrame({
    "label": df["label"],
    "features": features
})

print(new_df)

   label                                  features
0      1  [26, 27, 25, 29, 20, 23, 26, 25, 22, 23]
1      2  [20, 20, 26, 25, 23, 23, 26, 24, 27, 23]
2      3  [24, 22, 24, 22, 26, 23, 27, 22, 26, 23]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.