5

Problem Context

I am trying to create a chat log dataset from Whatsapp chats. Let me just provide the context of what problem I am trying to solve. Assume message to be M and response to be R. The natural way in which chats happen is not always alternate, for e.g. chats tend to happen like this

[ M, M, M, R, R, M, M, R, R, M ... and so on]

I am trying to concatenate continuously occurring strings of M's and R's. for the above example, I desire an output like this

Desired Output

[ "M M M", "R R", "M M" , "R R", "M ... and so on ]

An Example of Realistic Data:

Input --> ["M: Hi", "M: How are you?", "R: Heyy", "R: Im cool", "R: Wbu?"] (length=5)

Output --> ["M: Hi M: How are you?", "R: Heyy R: Im cool R: Wbu?"] (length = 2)

Is there a fast and more efficient way of doing this? I have already read this Stackoverflow link to solve this problem. But, I didn't find a solution there.

So far, this is what I have tried.

final= []
temp = ''
change = 0
for i,ele in enumerate(chats):
    if i>0:
        prev = chats[i-1][0]
        current = ele[0]

        if current == prev:
            continuous_string += chats[i-1]  
            continue
        else:
            continuous_string += chats[i-1]
            final.append(temp)
            temp = ''

Explanation of my code: I have chats list in which the starting character of every message is 'M' and starting character of every response is 'R'. I keep track of prev value and current value in the list, and when there is a change (A transition from M -> R or R -> M), I append everything collected in the continuous_string to final list.

Again, my question is: Is there a shortcut in Python or a function to do the same thing effectively in less number of lines?

4
  • 2
    Why are you doing + '. ' if there is no . in the desired output? Commented Feb 24, 2019 at 13:21
  • 1
    Ahh! Those are just messages which I want to concatenate with ". ". For the sake of the problem, their presence is irrelevant. Thanks for pointing out. I will make an edit! @Sanya Commented Feb 24, 2019 at 13:25
  • Please add some realistic sample data to your question so that people will stop posting useless answers that only work with the letters "M" and "R". Commented Feb 24, 2019 at 13:38
  • @TrebuchetMS Yes sir. Please look at the edit. Commented Feb 24, 2019 at 13:59

2 Answers 2

5

You can use the function groupby():

from itertools import groupby

l = ['A', 'A', 'B', 'B']

[' '.join(g) for _, g in groupby(l)]
# ['A A', 'B B']

To group data from your example you need to add a key to the the groupby() function:

l = ["M: Hi", "M: How are you?", "R: Heyy", "R: Im cool", "R: Wbu?"]

[' '.join(g) for _, g in groupby(l, key=lambda x: x[0])]
# ['M: Hi M: How are you?', 'R: Heyy R: Im cool R: Wbu?']

As @TrebuchetMS mentioned in the comments the key lambda x: x.split(':')[0] might be more reliable. It depends on your data.

Sign up to request clarification or add additional context in comments.

5 Comments

I have edited the question a bit. Can you just show how those changes, please?
@Satya I added the solution for your realistic data.
Maybe x.partition(':')[0] or x.split(':')[0] in the lambda might be more reliable for data where the first letter is the same for different users. E.g. ["Megan: .", "Max: ."].
@MykolaZotko sorry I'm troubling you but can you give a brief explanation of how groupby works? Even a link which could explain it properly would be alright.
@Satya You get an iterator, which reterns consecutive keys and groups. Like a dict, where values are groups. For groupby(‘abbbcc’) you get an iteraretor which looks like {‘a’: [‘a’], ‘b’: [‘b’, ‘b’, ‘b’], ‘c’: [‘c’, ‘c’]} (lists from previous example in a gropby abject are iterators).
2

Algorithm

  • Initialize a temporary item. This will help determine if the speaker has changed
  • For each item
    • Extract the speaker
    • If it's the same, append to the text of the last item of the array
    • Else append a new item in the list containing the speaker and text

Implementation

def parse(x):
    parts = x.split(':')
    return parts[0], ' '.join(parts[1:]).strip()


def compress(l):
    ans = []
    prev = ''
    for x in l:
        curr, text = parse(x)
        if curr != prev:
            prev = curr
            ans.append(x)
        else:
            ans[len(ans) - 1] += f' {text}'
    return ans

Character names

IN:  ["M: Hi", "M: How are you?", "R: Heyy", "R: Im cool", "R: Wbu?"]
OUT: ['M: Hi How are you?', 'R: Heyy Im cool Wbu?']

String names

IN:  ["Mike: Hi", "Mike How are you?", "Mary: Heyy", "Mary: Im cool", "Mary: Wbu?"]
OUT: ['Mike: Hi How are you?', 'Mary: Heyy Im cool Wbu?']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.