0

I have a dataframe where if a certain condition is met, I'd like to essentially create a duplicate of that row. Row should be duplicated IF 'Date' = Q4.22 or > AND type = 'live' Also, for every duplicate created the 'unit' count should be updated to reflect this (grouped by id and Date) Once the duplicate is established, the unit count should reflect the new count based on the same id and Date.

Data

   id   Date    set     type    unit    energy
   bb   Q4.22   l       live    l01     20
   bb   Q4.22   l       live    l02     20
   ba   Q3.22   l       non     l01     20
   aa   Q4.22   l       non     l01     20
   aa   Q4.22   l       live    l01     20
   cc   Q3.22   l       non     l01     20
   aa   Q4.22   l       live    l02     20
                

Desired

   id   Date    set     type    unit    energy
   bb   Q4.22   l       live    l01     20
   bb   Q4.22   l       live    l02     20
   bb   Q4.22   l       live    l03     20
   bb   Q4.22   l       live    l04     20
   ba   Q3.22   l       non     l01     20
   aa   Q4.22   l       non     l01     20
   aa   Q4.22   l       live    l01     20
   aa   Q4.22   l       live    l02     20
   cc   Q3.22   l       non     l01     20
   aa   Q4.22   l       live    l03     20
   aa   Q4.22   l       live    l04     20

Doing

pd.concat([df, df.loc[(df['Date'] == > 'Q4.22') & (df['live'] == 'live')]])

However, I still need to add counter to the new duplicates that are created. Any suggestion is appreciated.

4
  • In order to create the duplicate the following condition should be met. The Data column should be Q4.22 or later and the type column should contain 'live' string Commented May 11, 2022 at 20:28
  • how is 'unit' obtained in the first place and why is it a string f'l{i:02d}'? Commented May 11, 2022 at 20:51
  • @PierreD it is combination of the set 'l' the type and then the count for a given id and quarter (updated to lower case l) Commented May 11, 2022 at 20:52
  • 1
    there are several problems with this. I would have left unit to be a numerical value. Likewise, I would have left the Date to be a real date, or a Period. The way you select ... > 'Q4.22' does not generalize, because it means "lexicographically greater than or equal to". Imagine your comparison period was Q1 2022 instead. 'Q3.1900' >= 'Q1.2022', but Period('1900Q3') < Period('2022Q1'). Commented May 11, 2022 at 21:03

2 Answers 2

1

Try:

  1. Convert your date column to timestamps
  2. concat your original data with the filtered data
  3. groupby to get the cumcount of "id" and "Date" and set the "unit" accordingly
df["Date"] = pd.to_datetime(df["Date"].str.replace(r"(Q\d).(\d+)", r"\2-\1",regex=True))

output = pd.concat([df, df[df["Date"].ge(pd.Timestamp("2022-10-01"))&df["type"].eq("live")]], ignore_index=True)
output["unit"] = output["set"]+output.groupby(["id", "Date"]).cumcount().add(1).astype(str).str.zfill(2)
output = output.sort_values("id", ignore_index=True)

#convert Date back to original format if needed
output["Date"] = output["Date"].dt.to_period("Q").astype(str).str.replace(r"\d\d(\d+)(Q\d)",r"\2.\1",regex=True)

>>> output
    id   Date set  type unit  energy
0   aa  Q4.22   l   non  l01      20
1   aa  Q4.22   l  live  l02      20
2   aa  Q4.22   l  live  l03      20
3   aa  Q4.22   l  live  l04      20
4   aa  Q4.22   l  live  l05      20
5   ba  Q3.22   l   non  l01      20
6   bb  Q4.22   l  live  l01      20
7   bb  Q4.22   l  live  l02      20
8   bb  Q4.22   l  live  l03      20
9   bb  Q4.22   l  live  l04      20
10  cc  Q3.22   l   non  l01      20
Sign up to request clarification or add additional context in comments.

Comments

1

First, as noted in the comments, we need to convert some of the df columns into more convenient types:

  • int for unit (stripping any chars),
  • pd.Period for the Date.
df2 = df.assign(
    unit=df['unit'].str.extract(r'(\d+)').astype(int),
    period=df['Date'].str.replace(r'^(Q\d)\D*(\d+)$', r'\2\1', regex=True).apply(pd.Period)
)

>>> df2
   id   Date set  type  unit  energy  period
0  bb  Q4.22   l  live     1      20  2022Q4
1  bb  Q4.22   l  live     2      20  2022Q4
2  ba  Q3.22   l   non     1      20  2022Q3
3  aa  Q4.22   l   non     1      20  2022Q4
4  aa  Q4.22   l  live     1      20  2022Q4
5  cc  Q3.22   l   non     1      20  2022Q3
6  aa  Q4.22   l  live     2      20  2022Q4

>>> df2.dtypes
id               object
Date             object
set              object
type             object
unit              int64
energy            int64
period    period[Q-DEC]
dtype: object

With this done, now we can proceed with the logic of the question itself.

ix_repeat = (df2['period'] >= pd.Period('2022-Q4')) & (df2['type'] == 'live')
r = df2.loc[ix_repeat]
r.assign(unit=r['unit'] + r.groupby(['id', 'period'])['unit'].transform(max))

>>> r
   id   Date set  type  unit  energy  period
0  bb  Q4.22   l  live     3      20  2022Q4
1  bb  Q4.22   l  live     4      20  2022Q4
4  aa  Q4.22   l  live     3      20  2022Q4
6  aa  Q4.22   l  live     4      20  2022Q4

# finally
df2 = pd.concat([df2, r])

Optional: bringing back unit into its weird string version:

df2 = df2.assign(unit=df2['set'] + df2['unit'].astype(str).str.zfill(2))

>>> df2
   id   Date set  type unit  energy  period
0  bb  Q4.22   l  live  l01      20  2022Q4
1  bb  Q4.22   l  live  l02      20  2022Q4
2  ba  Q3.22   l   non  l01      20  2022Q3
3  aa  Q4.22   l   non  l01      20  2022Q4
4  aa  Q4.22   l  live  l01      20  2022Q4
5  cc  Q3.22   l   non  l01      20  2022Q3
6  aa  Q4.22   l  live  l02      20  2022Q4
0  bb  Q4.22   l  live  l03      20  2022Q4
1  bb  Q4.22   l  live  l04      20  2022Q4
4  aa  Q4.22   l  live  l03      20  2022Q4
6  aa  Q4.22   l  live  l04      20  2022Q4

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.