0

I am attempting to proper case the contents of a column that has various erroneous inputs. The code I used worked on everything except where the word is all caps.
I can't do "lower(initcap(field))" because it breaks some of the other corrections in the code. Is there a way to fix them all at once? The ones in yellow are wrong, as the code puts spaces between the letters.

-- Replace dashes and underscores with a space, then insert a space before all Capital Letters for multiple words that contain no spaces but have caps, then trim, then Capitalize the initial letter of all words that are still lower case in the first letter. Lastly, remove double spaces

select old_name,regexp_replace(initcap(trim(regexp_replace(replace(replace(old_name,'_',' '),'-',' '), '([A-Z])', ' \$1')) ) , '\\s+', ' ') as fixed_name from mytable

enter image description here

2 Answers 2

0

You can't use the initcap() function?

from pyspark.sql.functions import col, initcap

source_sdf = spark.createDataFrame([
    ("Instore Checkin",),
    ("Spend Transaction",),
    ("Bill Promotion",),
    ("Stab Upgrade Customer Promotion",),
    ("NOTAPPLICABLE",),
    ("Tire Rotations",),
    ("Return Bill",),
    ("NOTCAPTURED",),
    ("INVALID",)
], ["original"])

source_propered_sdf = source_sdf.withColumn("proper", initcap(col("original")))
source_propered_sdf.display()

The above creates the following...

enter image description here

Do whatever transformations you need ahead of the cap change e.g., trim, remove dashes, remove underscores, etc. Then apply the initcap function.

IMHO it will give you a more maintainable notebook than some crazy regex string, which no one will every want to touch again.

Sign up to request clarification or add additional context in comments.

Comments

0

I think this approach covers all of your example cases.

initcap(regexp_replace(regexp_replace(old_name, "[-_]", " "), "([a-z])([A-Z])", "\$1 \$2"))

Here is a full reprex:

%sql

with mytable as (
  select
    explode(
      array(
        "InstoreCheckin", 
        "SpendTransaction", 
        "BillPromotion", 
        "SlabUpgradeCustomerPromotion", 
        "NOT-APPLICABLE", 
        "Tire_Rotations", 
        "ReturnBill", 
        "NOT-CAPTURED", 
        "INVALID")
    ) as old_name
)
select
  -- source column
  old_name as `Old Name`,

  -- original code
  regexp_replace(
    initcap(
      trim(
        regexp_replace(
          replace(
            replace(
              old_name,
              '_',
              ' '
            ),
            '-',
            ' '
          ), 
          '([A-Z])', 
          ' \$1'
        )
      )
    ), 
    '\\s+', 
    ' '
  ) as `Fixed Code Name Original`,

  -- proposed code
  initcap(
    regexp_replace(
      regexp_replace(
        old_name, 
        "[-_]", 
        " "
      ), 
      "([a-z])([A-Z])", 
      "\$1 \$2"
     )
   ) as `Fixed Code Name Proposed`

from mytable

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.