Exploding columns

Question

I have the following dataset:

Date            Text
2020/05/12    Include details about your goal
2020/05/12    Describe expected and actual results
2020/05/13    Include any error messages
2020/05/13    The community is here to help you 
2020/05/14    Avoid asking opinion-based questions.

I cleaned it from punctuation, stopwords, ... in order to prepare it for exploding:

    stop_words = stopwords.words('english')


# punctuation to remove
    punctuation = string.punctuation.replace("'", '')  # don't remove apostrophe from strings
    punc = r'[{}]'.format(punctuation)

df.Text = df.Text.str.replace('\d+', '')  # remove numbers
    df.Text =df.Text.str.replace(punc, ' ')  # remove punctuation except apostrophe
   df.Text = df.Text.str.replace('\\s+', ' ')  # remove occurrences of more than one whitespace
    df.Text = df.Text.str.strip()  # remove whitespace from beginning and end of string
   df.Text = df.Text.str.lower()  # convert all to lowercase
    df.dropna(inplace=True)
    df.Text=df.Text.apply(lambda x: list(word for word in x.split() if word not in stop_words))  # remove words

However it works only for the first row, and not for all the rows. Next step would be

df_1 = df.explode('Text')

Can you please tell me what is wrong?

The first row is split as follows:

Text                                   New_Text (to show the difference after cleaning the text)
Include details about your goal    ['include','details','goal']

I have no other rows (so no rows starting with 'Describe...' or 'Avoid...'). In my dateset, I have 1942 rows but only 1 is returned after cleaning the text.

Update:

Example of output:

Date            Text
    2020/05/12    Include 
    2020/05/12    details 
    2020/05/12    goal
     ....         ...

Fixed issue (not it should work):

I think the below code should allow me to get this result:

(pd.melt(test.Text.apply(pd.Series).reset_index(), 
             id_vars=['Date'],
             value_name='Text')
     .set_index(['Date'])
     .drop('variable', axis=1)
     .dropna()
     .sort_index()
     )

To convert Date to an index: test=test.set_index(['Date'])

r-beginners · Accepted Answer · 2020-09-13 12:31:50Z

1

The code has been revised again as the question was updated. Your desired output was answered as the date column and word column expanded vertically.

import pandas as pd
import numpy as np
import io

data = '''
Date Text
2020/05/12 "Include details about your goal"
2020/05/12 "Describe expected and actual results"
2020/05/13 "Include any error messages"
2020/05/13 "The community is here to help you" 
2020/05/14 "Avoid asking opinion-based questions."
'''

test = pd.read_csv(io.StringIO(data), sep='\s+')
test.set_index('Date',inplace=True)
expand_df = test['Text'].str.split(' ', expand=True)
expand_df.reset_index(inplace=True)
expand_df = pd.melt(expand_df, id_vars='Date', value_vars=np.arange(6), value_name='text')
expand_df.dropna(axis=0, inplace=True, )
expand_df = expand_df[['Date', 'text']]
expand_df
    Date    text
0   2020/05/12  Include
1   2020/05/12  Describe
2   2020/05/13  Include
3   2020/05/13  The
4   2020/05/14  Avoid
5   2020/05/12  details
6   2020/05/12  expected
7   2020/05/13  any
8   2020/05/13  community
9   2020/05/14  asking
10  2020/05/12  about
11  2020/05/12  and
12  2020/05/13  error
13  2020/05/13  is
14  2020/05/14  opinion-based
15  2020/05/12  your
16  2020/05/12  actual
17  2020/05/13  messages
18  2020/05/13  here
19  2020/05/14  questions.
20  2020/05/12  goal
21  2020/05/12  results
23  2020/05/13  to
28  2020/05/13  help

edited Sep 13, 2020 at 12:31

answered Sep 11, 2020 at 7:05

r-beginners

35.6k3 gold badges20 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

still_learning Over a year ago

After cleaning df.Text I have tokens, e.g. ["Include","details", "goal"]. Same for the other rows. Could you please show me the all steps ? (those ones I wrote in my question, just to be sure I am applying df.Text.str.split in the right step? Thanks

r-beginners Over a year ago

The results of the df.Text.str.split() are pasted from the data at the top of the question. I can't show you the procedure without the data before the conversion.

still_learning Over a year ago

the problem is that this works fine for the first row, unfortunately not for the others

r-beginners Over a year ago

How about adding the first string to the question?

r-beginners Over a year ago

Fixed the code. Is the code you wrote the code that was added?

|

Collectives™ on Stack Overflow

Exploding columns

1 Answer 1

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related