1

I have a dataframe in which one columns values are lists of strings. here the structure of the file to read:

[
    {
        "key1":"value1 ",
        "key2":"2",
        "key3":["a","b  2 "," exp  white   space 210"],
    },
    {
        "key1":"value1 ",
        "key2":"2",
        "key3":[],
    },

]

I need to remove all white space for each item if it is more than one white space. expected output:

[
    {
        "key1":"value1",
        "key2":"2",
        "key3":["a","b2","exp white space 210"],
    },
    {
        "key1":"value1",
        "key2":"2",
        "key3":[],
    }
]

Note: I have some value that are empty in some lines e.g "key3":[]

5
  • Use df.replace('\s+', ' ', regex=True) for multiple spaces and use str.strip for the leading and trailing spaces Commented Mar 18, 2022 at 15:20
  • This is not working with the value in the array Commented Mar 18, 2022 at 15:24
  • It does work. I tested it. Commented Mar 18, 2022 at 15:25
  • Please change the question to put a sample problematic input, i.e., your empty list. People should be able to cut and paste your sample and reproduce the actual problem you're struggling with. Commented Mar 18, 2022 at 15:58
  • It is not a valid json after the change in the description Commented Mar 18, 2022 at 17:14

2 Answers 2

1

If I understand correctly some of your dataframe cells have list type values.

The file_name.json content is below:

[
    {
        "key1": "value1 ",
        "key2": "2",
        "key3": ["a", "b  2 ", " exp  white   space 210"]
    }, 
    {
        "key1": "value1 ",
        "key2": "2",
        "key3": []
    }
]

Possible solution in this case is the following:

import pandas as pd
import re

df = pd.read_json("file_name.json")


def cleanup_data(value):
    if value and type(value) is list:
        return [re.sub(r'\s+', ' ', x.strip()) for x in value]
    elif value and type(value) is str:
        return re.sub(r'\s+', ' ', value.strip())
    else:
        return value

# apply cleanup function to all cells in dataframe
df = df.applymap(cleanup_data)

df

Returns

     key1  key2                           key3
0  value1     2  [a, b 2, exp white space 210]
1  value1     2                             []
Sign up to request clarification or add additional context in comments.

2 Comments

I have an array of object, so this will not work
I updated the code to the new format of input data
0

If I understand correctly:

df = pd.read_json('''{
    "key1":"value1 ",
    "key2":"value2",
    "key3":["a","b   "," exp  white   space "],
    "key2":" value2"
}''')

df = df.apply(lambda col: col.str.strip().str.replace(r'\s+', ' ', regex=True))

Output:

>>> df
     key1    key2             key3
0  value1  value2                a
1  value1  value2                b
2  value1  value2  exp white space

>>> df.to_numpy()
array([['value1', 'value2', 'a'],
       ['value1', 'value2', 'b'],
       ['value1', 'value2', 'exp white space']], dtype=object)

5 Comments

I got this error AttributeError: Can only use .str accessor with string values!.
Will you please provide how you're reading the JSON file in the question? I think we're reading it differently, thus the error and your end and not on mine :)
df = pd.read_json("filename.json")
When I paste your JSON into filename.json, run df = pd.read_json("filename.json") and then df = df.apply(lambda col: col.str.strip().str.replace(r'\s+', ' ', regex=True)), it produces a dataframe just like the one in my answer. So I can't tell what's wrong...
I guess because that i have some value that are empty in some lines e.g "key3":[]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.