0

I have multiple dataframes stored in a dictionary.

Each dataframe has 3 columns as shown below

exceldata_1['Sheet1']

                      0                     1       2

0   Sv2.55+Fv2.04R02[2022-01-01T00         16      29.464Z]
1   - SC                                   OK       NaN
2   - PC1 Number                            1       NaN
3   - PC1 Main Status                      OK       NaN
4   - PC1 PV                       4294954868       NaN
... ... ... ...
1046    - C Temperature                  17�C        NaN
1047    Sv2.55+Fv2.04R02[2022-01-01T23     16       30.782Z]
1048    - Level SS             High      NaN
1049    Sv2.55+Fv2.04R02[2022-01-01T23  16  34.235Z]
1050    Sv2.55+Fv2.04R02[2022-01-01T23  16  38.657Z]
1051 rows × 3 columns

I want to do this : Search each row of dataframe if it contains "Sv2." and change that row as follows Remove "Sv2.55+Fv2.04R02[" this part and combine the remaining data to form the date and time correctly in each column....Showing the desired outpu below...The last column can be deleted as it will not contain any data after performing this operation.

           0                    1                 2

0     2022-01-01               00:16:29          NaN
1   - SC                          OK             NaN
2   - PC1 Number                  1              NaN
3   - PC1 Main Status            OK              NaN
4   - PC1 PV                   4294954868        NaN
... ... ... ...
1046    - C Temperature           17�C           NaN
1047    2022-01-01             23:16:30          NaN
1048    - Level SS               High            NaN
1049    2022-01-01             23:16:34          NaN
1050    2022-01-01             23:16:38          NaN
1051 rows × 3 columns

How can I achieve this?

5
  • Could you please clarify your input format a bit more: Are those parts Unnamed: 0 the index of the dataframe, or is it actually part of column 0 and you want to contain it in column 0? Best would be if you do a df.head(10).to_dict() or so and add it to the question, to avoid confusion. Commented Aug 25, 2022 at 9:19
  • @Timus Hie...I have edited the data..Please check Commented Aug 25, 2022 at 10:36
  • Thanks. I'm still a bit unsure regarding the format, but I have updated my first attempt (which only included the date part, and not the time). Commented Aug 25, 2022 at 11:25
  • Are the relevant string parts distributed over the 3 columns? (See my last edit.) Commented Aug 25, 2022 at 12:09
  • @Timus Yes..they are distributed over three columns Commented Aug 25, 2022 at 12:21

3 Answers 3

1

Using regular expressions should work

for i in range(len(df)):
    text=df['0'][i]
    if re.search('Sv',text)!=None:
        item_list=re.split('\[|T|\s\s|Z',text[:-1])
        df.iloc[i,0]=item_list[1]
        df.iloc[i,1]=item_list[2]+':'+item_list[3]+':'+item_list[4]
Sign up to request clarification or add additional context in comments.

6 Comments

Hie @Irsyaduddin..glad you answered it in a short time...I could not understand "text" in re.search() and re.split()....Can you please elaborate if possible? But I tried to implement the same..It gives me "Name Error: 'text' is not defined.
oh, I'm sorry, the text is supposed to be the text in your df column '0', i updated my answer
@Irsyaduddin...Yup..understood that now :)...But I get an error "IndexError: list index out of range"...Think there is some problem with the split and list...I tried to check what is stored in the item_list and found ['Sv2.55+Fv2.04R02', '2022-01-01', '0'] only 3 items..but we were trying to access 4th element in the list.I tried to figure it but was not so successful :p
ah, i see, can you give one example of the actual string? as im not sure the format for it, i copied yours and i got 2 whitespaces which is where the '\s\s' is for in the code
@Iesyaduddin The actual string would be this Sv2.55+Fv2.04R02[2022-01-01T00 16 29.464Z] right?
|
1

With df one of your dataframes you could try the following:

m = df[0].str.contains("Sv2.")
ser = df.loc[m, 0] + " " + df.loc[m, 1] + " " + df.loc[m, 2]
datetime = pd.to_datetime(
    ser.str.extract(r"Sv2\..*?\[(.*?)\]")[0].str.replace(r"\s+", " ", regex=True),
    format="%Y-%m-%dT%H %M %S.%fZ"
)
df.loc[m, 0] = datetime.dt.strftime("%Y-%m-%d")
df.loc[m, 1] = datetime.dt.strftime("%H:%M:%S")
df.loc[m, 2] = np.NaN
  • First build a mask m that selects the rows that contain a "Sv2." in the first column.
  • Based on that build a series ser with the relevant strings, added together with a blank inbetween.
  • Use .str.extract to fetch the datetime-part via the capture group of a regex: Look for the "Sv2."-part, then go forward until the opening bracket "[", and then catch all until the closing bracket "]".
  • Convert those strings with pd.to_datetime to datetimes (see here for the format codes).
  • Extract the required parts with .dt.strftime into the resp. columns.

Alternative approach without real datetimes:

m = df[0].str.contains("Sv2.")
ser = df.loc[m, 0] + " " + df.loc[m, 1] + " " + df.loc[m, 2]
datetime = ser.str.extract(
    r"Sv2\..*?\[(\d{4}-\d{2}-\d{2}).*?(\d{2}\s+\d{2}\s+\d{2})\."
)
datetime[1] = datetime[1].str.replace(r"\s+", ":", regex=True)
df.loc[m, [0, 1]] = datetime
df.loc[m, 2] = np.NaN

Result for the following sample df (taken from your example)

                                0     1         2
0  Sv2.55+Fv2.04R02[2022-01-01T00    16  29.464Z]
1                            - SC    Ok       NaN
2                     - PC Number     1       NaN
3                         - PC MS    Ok       NaN
4                     - PC PValue     8       NaN
5                      - Level SS  High       NaN
6  Sv2.55+Fv2.04R02[2022-01-01T23    16  34.235Z]
7  Sv2.55+Fv2.04R02[2022-01-01T23    16  38.657Z]

is

             0         1   2
0   2022-01-01  00:16:29 NaN
1         - SC        Ok NaN
2  - PC Number         1 NaN
3      - PC MS        Ok NaN
4  - PC PValue         8 NaN
5   - Level SS      High NaN
6   2022-01-01  23:16:34 NaN
7   2022-01-01  23:16:38 NaN

1 Comment

Hello ...Thank You so much for investing your valuable time on my question....But I have achieved my desired result by making few changes to the answer given by @Irsyaduddin ... You answer looks very elaborate..Thanks for your time...Hope it will be helpful for others
0

Thanks for the idea on how to proceed @Irsyaduddin ..With some modifications to his answer, I was able to achieve it.

  • Make sure all the data types in your dataframe are strings

    import re
    for i in range(len(df1)):
        text= (df1[0][i])+df1[1][i]+(df1[2][i]) #combining data from all cols
        if re.search('Sv',text)!=None:
            item_list=re.split('\[|T|Z',text)
            df1.iloc[i,0]=item_list[1]
            df1.iloc[i,1]=item_list[2][:2]+":"+item_list[2] 
                            [2:4]+":"+item_list[2][4:6]
            df1.iloc[i,2]='NaN'
      df1
    

Result:

                0                 1            2
    0   2022-01-01             00:16:29        NaN
    1   - Server Connection       OK           nan
    2   - PC1 Number               1           nan
    3   - PC1 MS                  OK           nan
    4   - PC1 PV               4294954868      nan
    ... ... ... ...
    1046    - C Temperature       17�C         nan
    1047    2022-01-01             23:16:30    NaN
    1048    - Level Sensor Status   High       nan
    1049    2022-01-01             23:16:34    NaN
    1050    2022-01-01             23:16:38    NaN
    1051 rows × 3 columns

Result of Split:

item_list
['Sv2.55+Fv2.04R02', '2022-01-01', '001629.464', '] ']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.