0

I have the following dataset:

Date            Text
2020/05/12    Include details about your goal
2020/05/12    Describe expected and actual results
2020/05/13    Include any error messages
2020/05/13    The community is here to help you 
2020/05/14    Avoid asking opinion-based questions.

I cleaned it from punctuation, stopwords, ... in order to prepare it for exploding:

    stop_words = stopwords.words('english')


# punctuation to remove
    punctuation = string.punctuation.replace("'", '')  # don't remove apostrophe from strings
    punc = r'[{}]'.format(punctuation)

df.Text = df.Text.str.replace('\d+', '')  # remove numbers
    df.Text =df.Text.str.replace(punc, ' ')  # remove punctuation except apostrophe
   df.Text = df.Text.str.replace('\\s+', ' ')  # remove occurrences of more than one whitespace
    df.Text = df.Text.str.strip()  # remove whitespace from beginning and end of string
   df.Text = df.Text.str.lower()  # convert all to lowercase
    df.dropna(inplace=True)
    df.Text=df.Text.apply(lambda x: list(word for word in x.split() if word not in stop_words))  # remove words
    

However it works only for the first row, and not for all the rows. Next step would be

df_1 = df.explode('Text')

Can you please tell me what is wrong?

The first row is split as follows:

Text                                   New_Text (to show the difference after cleaning the text)
Include details about your goal    ['include','details','goal']

I have no other rows (so no rows starting with 'Describe...' or 'Avoid...'). In my dateset, I have 1942 rows but only 1 is returned after cleaning the text.

Update:

Example of output:

Date            Text
    2020/05/12    Include 
    2020/05/12    details 
    2020/05/12    goal
     ....         ...

Fixed issue (not it should work):

I think the below code should allow me to get this result:

(pd.melt(test.Text.apply(pd.Series).reset_index(), 
             id_vars=['Date'],
             value_name='Text')
     .set_index(['Date'])
     .drop('variable', axis=1)
     .dropna()
     .sort_index()
     )

To convert Date to an index: test=test.set_index(['Date'])

1 Answer 1

1

The code has been revised again as the question was updated. Your desired output was answered as the date column and word column expanded vertically.

import pandas as pd
import numpy as np
import io

data = '''
Date Text
2020/05/12 "Include details about your goal"
2020/05/12 "Describe expected and actual results"
2020/05/13 "Include any error messages"
2020/05/13 "The community is here to help you" 
2020/05/14 "Avoid asking opinion-based questions."
'''

test = pd.read_csv(io.StringIO(data), sep='\s+')
test.set_index('Date',inplace=True)
expand_df = test['Text'].str.split(' ', expand=True)
expand_df.reset_index(inplace=True)
expand_df = pd.melt(expand_df, id_vars='Date', value_vars=np.arange(6), value_name='text')
expand_df.dropna(axis=0, inplace=True, )
expand_df = expand_df[['Date', 'text']]
expand_df
    Date    text
0   2020/05/12  Include
1   2020/05/12  Describe
2   2020/05/13  Include
3   2020/05/13  The
4   2020/05/14  Avoid
5   2020/05/12  details
6   2020/05/12  expected
7   2020/05/13  any
8   2020/05/13  community
9   2020/05/14  asking
10  2020/05/12  about
11  2020/05/12  and
12  2020/05/13  error
13  2020/05/13  is
14  2020/05/14  opinion-based
15  2020/05/12  your
16  2020/05/12  actual
17  2020/05/13  messages
18  2020/05/13  here
19  2020/05/14  questions.
20  2020/05/12  goal
21  2020/05/12  results
23  2020/05/13  to
28  2020/05/13  help
Sign up to request clarification or add additional context in comments.

11 Comments

After cleaning df.Text I have tokens, e.g. ["Include","details", "goal"]. Same for the other rows. Could you please show me the all steps ? (those ones I wrote in my question, just to be sure I am applying df.Text.str.split in the right step? Thanks
The results of the df.Text.str.split() are pasted from the data at the top of the question. I can't show you the procedure without the data before the conversion.
the problem is that this works fine for the first row, unfortunately not for the others
How about adding the first string to the question?
Fixed the code. Is the code you wrote the code that was added?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.