3

I have a pandas dataframe as shown here. There are many more columns in that frame that are not important concerning the task.

id    pos      value       sente
1     a         I           21
2     b         have        21
3     b         a           21
4     a         cat         21
5     d         !           21
1     a         My          22
2     a         cat         22
3     b         is          22
4     a         cute        22
5     d         .           22

I would like to make a list out of certain colums so the first sentence (sente=21) and every other looks something like that. Meaing that every sentence has an unique entry for itself.

`[('I', 'a', '1'), ..., ('!','d','5')]`

I already have a function to do this for one sentence but I can not figure out how to do it for all sentences (sentences that have the same sente value) in the frame.

`class SentenceGetter(object):
  def __init__(self, data):
    self.n_sent = 1
    self.data = data
    self.empty = False
  def get_next(self):
    for t in self.data:
        try:
            s = self.data[(self.data["sente"] == 21)]
            self.n_sent += 1
            return 
              s["id"].values.tolist(),   
              s["pos"].values.tolist(),
              s["value"].values.tolist() 
        except:
            self.empty = True
            return None,None,None

foo = SentenceGetter(df)
sent, pos, token = foo.get_next()
in = zip(token, pos, sent)

`

As my frame is very large there is no way to use constructions like this:

df.loc[((df["sente"] == df["sente"].shift(-1)) & (df["sente"] == df["sente"].shift(+1))), ["pos","value","id"]]

Any ideas?

0

3 Answers 3

2

If you are open to using the standard library, collections.defaultdict offers an O(n) solution:

from collections import defaultdict

d = defaultdict(list)

for _, num, *data in df[['sente', 'value', 'pos', 'id']].itertuples():
    d[num].append(data)

Result:

defaultdict(list,
            {21: [('I', 'a', 1),
                  ('have', 'b', 2),
                  ('a', 'b', 3),
                  ('cat', 'a', 4),
                  ('!', 'd', 5)],
             22: [('My', 'a', 1),
                  ('cat', 'a', 2),
                  ('is', 'b', 3),
                  ('cute', 'a', 4),
                  ('.', 'd', 5)]})
Sign up to request clarification or add additional context in comments.

Comments

2

You can also use groupby and apply functions.

Method 1: It gives a data frame

(df
 .groupby('sente')
 .apply(lambda df: list(tuple(x) for x in df[['value','pos','id']].values))
 .reset_index()
 .rename(columns={0: 'values'}))

   sente                                             values
0     21  [(I, a, 1), (have, b, 2), (a, b, 3), (cat, a, ...
1     22  [(My, a, 1), (cat, a, 2), (is, b, 3), (cute, a...

Method 2: It gives a dictionary

(df
 .groupby('sente')
 .apply(lambda df: list(tuple(x) for x in df[['value','pos','id']].values))
 .reset_index()
 .set_index('sente')[0].to_dict())

Comments

1

Essentially the same as @YOLO's answer

def f(df):
    s = df[['value','pos','id']].apply(tuple, axis=1)
    return s.tolist()
g = df.groupby('sente')
q = g.apply(f)

>>> type(q)
<class 'pandas.core.series.Series'>
>>> q[21]
[('I', 'a', 1), ('have', 'b', 2), ('a', 'b', 3), ('cat', 'a', 4), ('!', 'd', 5)]
>>> q[22]
[('My', 'a', 1), ('cat', 'a', 2), ('is', 'b', 3), ('cute', 'a', 4), ('.', 'd', 5)]

>>> q.tolist()
[[('I', 'a', 1), ('have', 'b', 2), ('a', 'b', 3), ('cat', 'a', 4), ('!', 'd', 5)], [('My', 'a', 1), ('cat', 'a', 2), ('is', 'b', 3), ('cute', 'a', 4), ('.', 'd', 5)]]
>>>
>>> q.to_dict()
{21: [('I', 'a', 1), ('have', 'b', 2), ('a', 'b', 3), ('cat', 'a', 4), ('!', 'd', 5)], 22: [('My', 'a', 1), ('cat', 'a', 2), ('is', 'b', 3), ('cute', 'a', 4), ('.', 'd', 5)]}
>>>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.