0

I am attempting to mask SSN numbers with Random SSNs in a large text file. The file is 400M or .4 gigs.

There are 17,000 instances of SSNs that i want to find and replace.

Here is an example of the powershell script I am using.

(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "123-45-6789", "666-66-6666"} | set-content C:\TrainingFile\TrainingFile.txt

My problem is that that i have 17,000 lines of this code to that I have in a .ps1 file. The ps1 file looks similar to

(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "123-45-6789", "666-66-6666"} | set-content C:\TrainingFile\TrainingFile.txt

(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "122-45-6789", "666-66-6668"} | set-content C:\TrainingFile\TrainingFile.txt

(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "223-45-6789", "666-66-6667"} | set-content C:\TrainingFile\TrainingFile.txt

(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "123-44-6789", "666-66-6669"} | set-content C:\TrainingFile\TrainingFile.txt

For 17,000 powershell commands in the .ps1 file. One command per line.

I did a test on just one command and it took about 15 secoonds to execute. Doing the math, 170000 X 15 seconds comes out to about 3 days to run my .ps1 script of 17,000 commands.

Is there a faster way to do this?

5
  • Do the replacements need to be mapped to specific numbers like your example, or can it be any random 3-2-4 digit sequence? Commented Jun 23, 2014 at 12:29
  • I have already generated the 17,000 Unique Random SSNS so there are no duplicates in that regard. I just used the 666-66-6666 as an example Commented Jun 23, 2014 at 13:07
  • I understand. The question is, does each individual SSN need to be mapped to a specific replacement random SSN or can it be any one of them as long as each one gets a unique replacement string? Commented Jun 23, 2014 at 13:11
  • Each SSN needs to be mapped to a specific replacement random SSN. Commented Jun 23, 2014 at 15:13
  • Updated my answer with different solution for that scenario. Commented Jun 23, 2014 at 15:50

4 Answers 4

2

The reason for poor performance is that a lot of extra work is being done. Let's look the process as a pseudoalgorithm like so,

select SSN (X) and masked SSN (X') from a list
read all rows from file
look each file row for string X
if found, replace with X'
save all rows to file
loop until all SSNs are processed

So what's the problem? It is that for each SSN replacement, you process all the rows. Not only those that do need masking but those that don't. That's a lot of extra work. If you got, say 100 rows and 10 replacements, you are going to use 1000 steps when only 100 are needed. In addition, reading and saving file creates disk IO. Whlist that's not often an issue for single operation, multiply the IO cost with loop count and you'll find quite large a time wasted for disk waits.

For great performance, tune the algorithm like so,

read all rows from file
loop through rows
for current row, change X -> X'
save the result

Why should this be faster? 1) You read and save the file once. Disk IO is slow. 2) You process each row only once, so extra work is not being done. As how to actually perform the X -> X' transform, you got to define more carefully what the masking rule is.

Edit

Here's more practical an resolution:

Since you already know the f(X) -> X' results, you should have a pre-calculated list saved to disk like so,

ssn, mask
"123-45-6789", "666-66-6666"
...
"223-45-6789", "666-66-6667"

Import the file into a hash table and work forward by stealing all the juicy bits from Ansgar's answer like so,

$ssnMask = @{}
$ssn = import-csv "c:\temp\SSNMasks.csv" -delimiter ","

# Add X -> X' to hashtable
$ssn | % {
  if(-not $ssnMask.ContainsKey($_.ssn)) {
    # It's an error to add existing key, so check first 
    $ssnMask.Add($_.ssn, $_.mask)
  }
}

$dataToMask = get-content "c:\temp\training.txt"
$dataToMask | % {
   if ( $_ -match '(\d{3}-\d{2}-\d{4})' ) {
     # Replace SSN look-a-like with value from hashtable
     # NB: This simply removes SSNs that don't have a match in hashtable
     $_ -replace  $matches[1], $ssnMask[$matches[1]]
   }
} | set-content "c:\temp\training2.txt"
Sign up to request clarification or add additional context in comments.

4 Comments

I have already generated the 17,000 Unique Random SSNS so there are no duplicates in that regard. I just used the 666-66-6666 as an example
I don't think that's going to perform very well. You got an array of 17,000 PS custom objects ($ssn), a hash table with 17,000 entries ($ssnMask), and the entire contents of a 400MB file converted to a string array ($datatoMask) all resident in memory at the same time. That's a lot of memory, and a lot of cpu to manage it.
@mjolinor Good point about performance. I tried with some dummy data. Hashtable with 17000 keys isn't a problem but the 400 Mb input file is somewhat on the heavy side. I splitted my test data processing into 1000 batches. My two year old i7/8Gb laptop did the whole thing in about 8 minutes and Powershell used some 2 Gb of memory. Further optimization sure is possible.
Files that size generally do better with pipeline solutions. I updated my answer to use -ReadCount, which should help by reducing the I/O, but it's still going to have to do one line at a time for the replacements.
0

Avoid reading and writing the file multiple times. I/O is expensive and is what slows your script down. Try something like this:

$filename = 'C:\TrainingFile\TrainingFile.txt'

$ssnMap = @{}
(Get-Content $filename) | % {
  if ( $_ -match '(\d{3}-\d{2}-\d{4})' ) {
    # If SSN is found, check if a mapping of that SSN to a random SSN exists.
    # Otherwise create a new mapping.
    if ( -not $ssnMap.ContainsKey($matches[1]) ) {
      do {
        $rnd = Get-Random -Min 100000 -Max 999999
        $newSSN = "666-$($rnd -replace '(..)(....)','$1-$2')"
      } while ( $ssnMap.ContainsValue($newSSN) )  # loop to avoid collisions
      $ssnMap[$matches[1]] = $newSSN
    }

    # Replace the SSN with the corresponding randomly generated SSN.
    $_ -replace $matches[1], $ssnMap[$matches[1]]
  } else {
    # If no SSN is found, simply print the line.
    $_
  }
} | Set-Content $filename

If you already have a list of random SSNs and also have them mapped to specific "real" SSNs, you could read those mappings from a CSV (example column titles: realSSN, randomSSN) into the $ssnMap hashtable:

$ssnMap = @{}
Import-Csv 'C:\mappings.csv' | % { $ssnMap[$_.realSSN] = $_.randomSSN }

2 Comments

I think you're going to run out of possible random SSN replacements before it finishes the file. There are 17,000 SSNs to be replaced, and that will only be able to generate 9,000 possible replacement strings.
I have already generated the 17,000 Unique Random SSNS so there are no duplicates in that regard. I just used the 666-66-6666 as an example
0

If you've already generated a list of random SSNs for replacement, and each SSN in the file just needs to be replaced with one of them (not necessarily mapped to a specific replacement string), thing I think this will be much faster:

$inputfile = 'C:\TrainingFile\TrainingFile.txt'
$outputfile = 'C:\TrainingFile\NewTrainingFile.txt'

$replacements = Get-Content 'C:\TrainingFile\SSN_Replacements.txt'

$i=0

Filter Replace-SSN { $_ -replace '\d{3}-\d{2}-\d{4}',$replacements[$i++] }

Get-Content $inputfile |
Replace-SSN |
Set-Content $outputfile

This will walk through your list of replacement SSNs, selecting the next one in the list for each new replacement.

Edit:

Here's a solution for mapping specific SSNs to specific replacement strings. It assumes you have a CSV file of the original SSNs and their intended replacement strings, as columns 'OldSSN' and 'NewSSN':

$inputfile = 'C:\TrainingFile\TrainingFile.txt'
$outputfile = 'C:\TrainingFile\NewTrainingFile.txt'
$replacementfile = 'C:\TrainingFile\SSN_Replacements.csv' 

$SSNmatch = [regex]'\d{3}-\d{2}-\d{4}'

$replacements = @{}

Import-Csv $replacementfile |
 ForEach-Object { $replacements[$_.OldSSN] = $_.NewSSN }

Get-Content $inputfile -ReadCount 1000|

 ForEach-Object {
  foreach ($Line in $_){
  if ( $Line -match $SSNmatch ) #Found SSN in line
    { if ( $replacements.ContainsKey($matches[0]) ) #Found replacement string for this SSN
        { $Line -replace $SSNmatch,$replacements[$matches[0]] } #Replace SSN and ouput line

      else {Write-Warning "Warning - no replacement string found for $($matches[0])"
    }

   }

  else { $Line } #No SSN in this line - output line as-is
 }
} | Set-Content $outputfile

Comments

-1
# Fairly fast PowerShell code for masking up to 1000 SSN number per line in a large text file (with unlimited # of lines in the file) where the SSN matches the pattern of " ###-##-#### ", " ##-####### ", or " ######### ".
# This code can handle a 14 MB text file that has SSN numbers in nearly every row within about 4 minutes.


# $inputFilename = 'C:/InputFile.txt'

$inputFileName = "
1                                                                                                                                    
           0550       125665    338066                                                                                               
-                   02 CR05635                                  07/06/16                                                             
0     SAMPLE CUSTOMER NAME                                                                                                   
      PO BOX 12345                                                                                                                  
      ROSEVILLE CA 12345-9109                                                                                                        




 EMPLOYEE DEFERRALS                                                                                        
 FREDDIE MAC RO 16 9385456   164-44-9120     XXX                                                                               
 SALLY MAE RO 95 9385356   07-4719130     XXX                                                                               
 FRED FLINTSTONE RO 95 1185456   061741130     XXX  
 WILMA FLINTSTONE RO 91 9235456   364-74-9130  123456789 123456389 987354321    XXX                                                          
 PEBBLES RUBBLE RO 10 9235456 06-3749130  064-74-9150  034-74-9130  XXX                                                                               
 BARNEY RUBBLE RO 11 9235456 06-3449130 06-3749140 063-74-9130     XXX                                                                               
 BETTY RUBBLE RO 16 9235456   9-74-9140  123456789 123456789 987654321    XXX                                                                               

 PLEASE ENTER BELOW ANY ADDITIONAL PARTICIPANTS FOR WHOM YOU ARE                                                                     
 REMITTING.  FOR GENERAL INFORMATION AND SERVICE CALL                                                                              
"

$outputFilename = 'D:/OutFile.txt'

#(Get-Content $inputFilename ) | % {

($inputFilename ) | % {

       $NewLine=$_
       # Write-Host "0 new line value is ($NewLine)."
       $ChangeFound='Y'

       $WhileCounter=0


       While (($ChangeFound -eq 'Y') -and ($WhileCounter -lt 1000))
       {
       $WhileCounter=$WhileCounter+1
       $ChangeFound='N'

       $matches = $NewLine | Select-String -pattern "[ ][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9][ |\t|\r|\n]" -AllMatches
       If ($matches.length -gt 0)
       {
          $ChangeFound='Y'
          $NewLine=''
          for($i = 0; $i -lt 1; $i++){
              for($k = 0; $k -lt 1; $k++){
                  # Write-Host "AmHere 1a `$i ($i), `$k ($k), `$NewLine ($NewLine)."
                  $t = $matches[$i] -replace $matches[$i].matches[$k].value, (" ###-##-" + $matches[$i].matches[$k].value.substring(8) )
                  $NewLine=$NewLine + $t
                  # Write-Host "AmHere 1b `$i ($i), `$k ($k), `$NewLine ($NewLine)."

              }
          }
          # Write-Host "1 new line value is ($NewLine)."
       }
       $matches = $NewLine | Select-String -pattern "[ ][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][ |\t|\r|\n]" -AllMatches
       If ($matches.length -gt 0)
       {
          $ChangeFound='Y'
          $NewLine=''
          for($i = 0; $i -lt 1; $i++){
              for($k = 0; $k -lt 1; $k++){
                  # Write-Host "AmHere 2a `$i ($i), `$k ($k), `$NewLine ($NewLine)."
                  $t = $matches[$i] -replace $matches[$i].matches[$k].value, (" ##-###" + $matches[$i].matches[$k].value.substring(7) )
                  $NewLine=$NewLine + $t
                  # Write-Host "AmHere 2b `$i ($i), `$k ($k), `$NewLine ($NewLine)."
              }
          }
          # Write-Host "2 new line value is ($NewLine)."
       }
       $matches = $NewLine | Select-String -pattern "[ ][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][ |\t|\r|\n]" -AllMatches
       If ($matches.length -gt 0)
       {
          $ChangeFound='Y'
          $NewLine=''
          for($i = 0; $i -lt 1; $i++){
              for($k = 0; $k -lt 1; $k++){
                  # Write-Host "AmHere 3a `$i ($i), `$k ($k), `$NewLine ($NewLine)."
                  $t = $matches[$i] -replace $matches[$i].matches[$k].value, (" #####" + $matches[$i].matches[$k].value.substring(6) )
                  $NewLine=$NewLine + $t
                  # Write-Host "AmHere 3b `$i ($i), `$k ($k), `$NewLine ($NewLine)."
              }
          }
          #print the line
          # Write-Host "3 new line value is ($NewLine)."
       }
       # Write-Host "4 new line value is ($NewLine)."

       } # end of DoWhile
       Write-Host "5 new line value is ($NewLine)."

       $NewLine

    # Replace the SSN with the corresponding randomly generated SSN.
    # $_ -replace $matches[1], $ssnMap[$matches[1]]
 } | Set-Content $outputFilename

2 Comments

This answer needs some formatting checkout the help section of StackOverfow.
Actually, execution time for the 14 MB file was 2 minutes, and not 4 minutes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.