1
$\begingroup$

I have a list of events topic retrieved from a tweets collection. A set of features have been extracted and their values normalized between 0 and 1. An example of an event:

"paris_attack_news-20150107_100842-20150107_112852": {
    "ages": 0.5557594006583049,
    "density": 0.0012022814250710345,
    "followers": 0.1144661871115895,
    "friends": 0.13507755010659472,
    "hashtagCount": 0.033270950301517985,
    "lifespan": 0.29613227044582224,
    "mediaCount": 0.1095890410958904,
    "mentionCount": 0.020275919732441472,
    "objectivity": 0.2850584551023736,
    "polarity": 0.2963684492294102,
    "retweetCount": 0.21431767337807606,
    "status_count": 0.09222093073720204,
    "truth": 1.0,
    "tweetCount": 0.01300578034682081,
    "urlCount": 0.29494007989347537,
    "verified": 0.3392857142857143
}

Now I need to represent each event as an array of its features:

paris_attack_news-20150107_100842-20150107_112852 = [0.5557594006583049, 0.5557594006583049, 0.1144661871115895, 0.13507755010659472, ...]

After that, I need to manipulate/aggregate the array values in some ways to get a specific events sorting based on the results.

The data are already in a Python Pandas DataFrame (event name as index, features as columns).

Which is the best way (data structure for further storing or libraries such as NumPy, sklearn or similar) to build the arrays starting from this?

P.S.: Then I'll need to apply some machine learning alghoritms to detect if an event is TRUE or FALSE, using the feature named "truth": 1 or 0 as label classification.

$\endgroup$
2
  • $\begingroup$ You already have Pandas DataFrame, it is the best way I know to have a table of features + label. Pandas provides you with many tools to deal with missing values, aggregate, generate new features, and it is easy to use a DataFrame with all available machine learning modules like sklearn. Example: medium.com/simple-ai/… $\endgroup$ Commented Mar 28, 2018 at 23:29
  • $\begingroup$ Seems like you are usinga JSON $\endgroup$ Commented Mar 29, 2018 at 1:03

2 Answers 2

1
$\begingroup$

Numpy is best. You can just take

np.array(dict.values()).

Eg, dict = Paris_attck

Most ML libraries operate on numpy arrays.

$\endgroup$
3
  • $\begingroup$ Better use is for sparse matrices $\endgroup$ Commented Mar 29, 2018 at 1:03
  • $\begingroup$ Dictionaries are better for sparse matrices. $\endgroup$ Commented Apr 3, 2018 at 17:09
  • $\begingroup$ Stacking sparse matrices using csr_matrix $\endgroup$ Commented Apr 3, 2018 at 17:40
1
$\begingroup$

Lists in Python are extremely powerful and working with a list of lists is not a complicated procedure.

Something like the following may be something you'd like to investigate.

events = [] for event in range(0,len(df.index)): an_event = [] an_event.append(df[event,'ages']) an_event.append(df[event,'density']) events.append(an_event)

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.