Python apply a lambda function into a csv file(Large files)

Question

I want to apply a this function hideEmail to a specific column of my csv file (large file) using python

Example of function :

def hideEmail(email):
    #hide email
    text = re.sub(r'[^@.]', 'x', email)
    return text

Csv file (large file > 1gb):

    id;Name;firstName;email;profession
    100;toto;tata;[email protected];developer
    101;titi;tete;[email protected];doctor
    ..
    ..

fsl · Accepted Answer · 2021-03-29 09:50:46Z

4

Load the csv data into a DataFrame:

df = pd.read_csv(r'/path/to/csv')

Then you can just use pd.Series.str.replace directly as it supports regex by default:

df = df.astype(str).apply(lambda x: x.str.replace(r'[^@.]', 'x'), axis=1)

That said, if all you want to do is changing a large csv file, pandas is probably an overkill.. You might have a look at sed. Here's one example:

sed -E 's/(\w+)@(\w+)/xxx@xxx/' /path/to/file.csv > /path/to/new_file.csv

edited Mar 29, 2021 at 9:50

answered Feb 28, 2021 at 14:13

fsl

3,2801 gold badge12 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

siiraaj Over a year ago

Thanks@FelipeLanza but i have others functions in python to apply, and unfortunately there is no regex, so i cant use sed

fsl Over a year ago

It most certainly does support regex. Might have a look here: gnu.org/software/sed/manual/sed.html#sed-regular-expressions.

Red · Accepted Answer · 2021-03-01 13:46:10Z

3

Its a bit hard to know without the data frame, but you can try:

import pandas as pd #import pandas
df = pd.read_csv('enter_file_path_here') #read the data

df['col'] = df['col'].apply(lambda x: hideEmail(x))
#if you want to make it back to a csv:
df.to_csv('name.csv')

edited Mar 1, 2021 at 13:46

Red

27.7k8 gold badges44 silver badges63 bronze badges

answered Feb 28, 2021 at 14:09

Epic_Yarin_God

1071 silver badge9 bronze badges

4 Comments

Robert Axe Over a year ago

Question is how to apply on csv file, not on pandas dataframe. I think you should include how to read and write pandas dataframe as well

Epic_Yarin_God Over a year ago

Right you are, I will edit it accordingly :)

The Singularity Over a year ago

I think the question is directed to a Pandas Dataframe

fsl Over a year ago

You don't need the lambda here.

joprocorp · Accepted Answer · 2021-02-28 14:17:24Z

3

Using pandas

You can use pandas as described here in a previous question to apply a function passed as parameter.

To export the dataframe obtained, use to_csv function described here

import pandas as pd

def hideEmail(email):
    #hide email
    text = re.sub(r'[^@.]', 'x', email)
    return text 
    

column_name = "email"

df = pd.read_csv(r'Path of your CSV file\File Name.csv')
df[column_name] = df[column_name].map(hideEmail)
df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv')

answered Feb 28, 2021 at 14:17

joprocorp

3562 silver badges12 bronze badges

Comments

Red · Accepted Answer · 2021-03-01 13:45:58Z

2

You can use built-in map() function to get it done as follows:

def hideEmail(email):
    #hide email
    text = re.sub(r'[^@.]', 'x', email)
    return text


with open('path/to/csvfile', 'r') as file:
     lines = [l.strip().split(';') for l in file.readlines()]

modifiedlines = []       # to store lines after email field is modified 

for i in lines[1:]:         # iterating from index 1 as index 0 is header
    i[3] = hideEmail(i[3])       # as email field is at index 3
    modifiedlines.append(';'.join(i))     # appending modified line

with open('path/to/csvfile', 'w') as file:
     file.writelines(modifiedlines)            # writing the lines back to file

edited Mar 1, 2021 at 13:45

Red

27.7k8 gold badges44 silver badges63 bronze badges

answered Feb 28, 2021 at 15:14

Shoaib Wani

1068 bronze badges

Comments

Red · Accepted Answer · 2021-03-25 12:03:10Z

1

You can use the built-in map() method to map the function to each line of the file:

import re

def hideEmail(email):
    #hide email
    text = re.sub(r'[^@.]', 'x', email)
    return text 

with open('file.csv', 'r') as r:
    r = map(hideEmail, r.readlines())

with open('file2.csv', 'w') as f:
    for line in r:
        f.write(line + '\n')

EDIT (credits to juanpa.arrivillaga for pointing it out):

The r = map(hideEmail, r.readlines()) can be replaced with just r = map(hideEmail, r).

edited Mar 25, 2021 at 12:03

answered Feb 28, 2021 at 14:28

Red

27.7k8 gold badges44 silver badges63 bronze badges

4 Comments

juanpa.arrivillaga Over a year ago

no need for r.readlines() just r = map(hideEmail, r) works

Red Over a year ago

@juanpa.arrivillaga Thank you for informing me.

siiraaj Over a year ago

@AnnZen how i can specify a column name to apply my lamda function ?

Zach Young Over a year ago

This will replace everything in the line that isn't @ or ., this solution is definitely missing the columns/fields aspect of the delimited input.

Collectives™ on Stack Overflow

Python apply a lambda function into a csv file(Large files)

5 Answers 5

2 Comments

4 Comments

Comments

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

4 Comments

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related