Pyspark - Replace portion of a string with different characters (uneven character count)

Question

I'm trying to replace a portion of a string with a different and shorter string which has : and +. Even though the values under the Start column is time, it is not a timestamp and instead it is recognised as a string.

I've tried using regexp_replace but currently don't know how to specify the last 8 characters in the string in the 'Start' column that needs to be replaced or specify the string that I want to replace with the new one.

df = df.withColumn('replaced', regexp_replace('Start', ':00+10:00', '00Z' ))

For example: I would start with the dataframe with columns:

| Data | Qualifier | Description | Start                    |
|:----:|:---------:|:-----------:|:------------------------:|
|12    | 54        | apple       |2021-03-03T02:00:00+10:00 |
|24    | 32        | banana      |2021-03-04T22:30:00+10:00 |
|24    | 32        | orange      |2021-03-04T11:58:00+10:00 |

And would like to have the Start column replaced with a new Column named 'Replaced' or retain 'Start' column header.

Expected Output:

| Data | Qualifier | Description | Replaced                 |
|:----:|:---------:|:-----------:|:------------------------:|
|12    | 54        | apple       |2021-03-03T02:00:00Z      |
|24    | 32        | banana      |2021-03-04T22:30:00Z      |
|24    | 32        | orange      |2021-03-04T11:58:00Z      |

Mohana B C · Accepted Answer · 2021-08-23 04:59:33Z

1

You can convert the column type to timestamp using to_timestamp() function by providing input timestamp pattern.

If you really want to use regexp_replace then in your regex pattern you have to escape + using \

>>> df = spark.createDataFrame(['2021-03-03T02:00:00+10:00'], StringType())
>>> df.withColumn('replaced', regexp_replace("value", "\+\d+:\d+", "Z")).show(truncate=False)

+-------------------------+--------------------+
|value                    |replaced            |
+-------------------------+--------------------+
|2021-03-03T02:00:00+10:00|2021-03-03T02:00:00Z|
+-------------------------+--------------------+

edited Aug 23, 2021 at 4:59

answered Aug 23, 2021 at 4:54

Mohana B C

5,4721 gold badge13 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

KPC2 · Accepted Answer · 2021-08-23 06:36:49Z

1

I have also been able to solve this without using regexp_replace and instead using expr and format_string, since the input was a strint and not a timestamp.

df= df.withColumn( "Start", expr("substring(Start, 1, length(Start)-8)") )

df= df.withColumn( "Start", format_string("%s00Z", "Start") )

answered Aug 23, 2021 at 6:36

KPC2

257 bronze badges

Collectives™ on Stack Overflow

Pyspark - Replace portion of a string with different characters (uneven character count)

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related