Best way to represent data as features vectors in Python

Question

I have a list of events topic retrieved from a tweets collection. A set of features have been extracted and their values normalized between 0 and 1. An example of an event:

"paris_attack_news-20150107_100842-20150107_112852": {
    "ages": 0.5557594006583049,
    "density": 0.0012022814250710345,
    "followers": 0.1144661871115895,
    "friends": 0.13507755010659472,
    "hashtagCount": 0.033270950301517985,
    "lifespan": 0.29613227044582224,
    "mediaCount": 0.1095890410958904,
    "mentionCount": 0.020275919732441472,
    "objectivity": 0.2850584551023736,
    "polarity": 0.2963684492294102,
    "retweetCount": 0.21431767337807606,
    "status_count": 0.09222093073720204,
    "truth": 1.0,
    "tweetCount": 0.01300578034682081,
    "urlCount": 0.29494007989347537,
    "verified": 0.3392857142857143
}

Now I need to represent each event as an array of its features:

paris_attack_news-20150107_100842-20150107_112852 = [0.5557594006583049, 0.5557594006583049, 0.1144661871115895, 0.13507755010659472, ...]

After that, I need to manipulate/aggregate the array values in some ways to get a specific events sorting based on the results.

The data are already in a Python Pandas DataFrame (event name as index, features as columns).

Which is the best way (data structure for further storing or libraries such as NumPy, sklearn or similar) to build the arrays starting from this?

P.S.: Then I'll need to apply some machine learning alghoritms to detect if an event is TRUE or FALSE, using the feature named "truth": 1 or 0 as label classification.

You already have Pandas DataFrame, it is the best way I know to have a table of features + label. Pandas provides you with many tools to deal with missing values, aggregate, generate new features, and it is easy to use a DataFrame with all available machine learning modules like sklearn. Example: medium.com/simple-ai/… — Abdul Rahman Bres
– Abdul Rahman Bres, Commented Mar 28, 2018 at 23:29

Samuel Weisenthal · Accepted Answer · 2018-03-28 23:26:01Z

1

Numpy is best. You can just take

np.array(dict.values()).

Eg, dict = Paris_attck

Most ML libraries operate on numpy arrays.

answered Mar 28, 2018 at 23:26

Samuel Weisenthal

1236 bronze badges

$\begingroup$ Better use is for sparse matrices $\endgroup$

Aditya
– Aditya

2018-03-29 01:03:45 +00:00
Commented Mar 29, 2018 at 1:03
$\begingroup$ Dictionaries are better for sparse matrices. $\endgroup$

Samuel Weisenthal
– Samuel Weisenthal

2018-04-03 17:09:09 +00:00
Commented Apr 3, 2018 at 17:09
$\begingroup$ Stacking sparse matrices using csr_matrix $\endgroup$

Aditya
– Aditya

2018-04-03 17:40:54 +00:00
Commented Apr 3, 2018 at 17:40

Add a comment |

xiA · Accepted Answer · 2018-03-29 02:35:24Z

1

Lists in Python are extremely powerful and working with a list of lists is not a complicated procedure.

Something like the following may be something you'd like to investigate.

events = [] for event in range(0,len(df.index)): an_event = [] an_event.append(df[event,'ages']) an_event.append(df[event,'density']) events.append(an_event)

answered Mar 29, 2018 at 2:35

xiA

1357 bronze badges

Add a comment |

Stack Exchange Network

Best way to represent data as features vectors in Python

2 Answers 2

Your Answer

Hot Network Questions

Best way to represent data as features vectors in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions