I have the following dataset:
Date Text
2020/05/12 Include details about your goal
2020/05/12 Describe expected and actual results
2020/05/13 Include any error messages
2020/05/13 The community is here to help you
2020/05/14 Avoid asking opinion-based questions.
I cleaned it from punctuation, stopwords, ... in order to prepare it for exploding:
stop_words = stopwords.words('english')
# punctuation to remove
punctuation = string.punctuation.replace("'", '') # don't remove apostrophe from strings
punc = r'[{}]'.format(punctuation)
df.Text = df.Text.str.replace('\d+', '') # remove numbers
df.Text =df.Text.str.replace(punc, ' ') # remove punctuation except apostrophe
df.Text = df.Text.str.replace('\\s+', ' ') # remove occurrences of more than one whitespace
df.Text = df.Text.str.strip() # remove whitespace from beginning and end of string
df.Text = df.Text.str.lower() # convert all to lowercase
df.dropna(inplace=True)
df.Text=df.Text.apply(lambda x: list(word for word in x.split() if word not in stop_words)) # remove words
However it works only for the first row, and not for all the rows. Next step would be
df_1 = df.explode('Text')
Can you please tell me what is wrong?
The first row is split as follows:
Text New_Text (to show the difference after cleaning the text)
Include details about your goal ['include','details','goal']
I have no other rows (so no rows starting with 'Describe...' or 'Avoid...'). In my dateset, I have 1942 rows but only 1 is returned after cleaning the text.
Update:
Example of output:
Date Text
2020/05/12 Include
2020/05/12 details
2020/05/12 goal
.... ...
Fixed issue (not it should work):
I think the below code should allow me to get this result:
(pd.melt(test.Text.apply(pd.Series).reset_index(),
id_vars=['Date'],
value_name='Text')
.set_index(['Date'])
.drop('variable', axis=1)
.dropna()
.sort_index()
)
To convert Date to an index: test=test.set_index(['Date'])