Parse huge file shell ( or other scripting language )

Question

I am trying to parse a huge file ( arround 13 GB ) and transpose it in a csv ( can also transpose it in two or three ). The file has the records on one line that is why it has arround 500.000.000 rows.Also, the attributes may vary from one record to another- some columns may appear and some may. I came up with a shell script for transposing it but it takes 12 minutes to process 1.000.000 rows, so it would take 100 hours to parse the hole file.

The shell script is the following:

#############################################
# Define the usage
#############################################

gUsage="
usage: %> `basename $0` <Run Date> <Input path> <Preprocessing path> <filename> 

where
    Input path:    Generic folder where the input file is for transposing
    Preprocessing path:   Generic folder where the processed file will be moved
    filename:   Template for filename

"

ls_current_date=`date +'%Y-%m-%d'`
ls_current_time=`date +'%H%M%S'`
ls_run_name="${ls_current_date}"_"${ls_current_time}"

i=-1
j=0
d=-1

# Check number of parameters 
if [ $# -ne 4  ]; then
    echo "" 
    echo "ERROR: Expecting 4 parameters" 
    echo "$gUsage" 
    exit
fi


ls_current_date=`date +'%Y-%m-%d'`
ls_current_time=`date +'%H%M%S'`
ls_run_name="${ls_current_date}"_"${ls_current_time}"




#############################################
# VN Declare & Check User Parameters + input files existence
#############################################



p_InputPath=$2
p_PreprocessingPath=$3
p_filename=$4

echo "Start time : $ls_run_name " > "${p_PreprocessingPath}/log.txt"
echo " Starting the transposing process..." >> "${p_PreprocessingPath}/log.txt"
echo "  " >> "${p_PreprocessingPath}/log.txt"
echo "  " >> "${p_PreprocessingPath}/log.txt"
### Parameter 1 is the Run Date will test for TODAY (today's date in the format YYYY-MM-DD)

if [ "$1" -eq "TODAY" ]; then
  p_Rundate=`date +'%Y-%m-%d'` 
else
 p_Rundate=$1
fi


echo "*************************************************************" 
echo "Checking File Existence" 
echo "*************************************************************"   

ODSM_FILE="$p_InputPath/$p_filename"

if [ -f $ODSM_FILE ]; 
then
   echo "Source file ODSM found: $ODSM_FILE !" 
else
   echo "ERROR: source file ODSM_FILE does not exist or does not match the pattern $ODSM_FILE." 
   exit
fi

#Define the header of the file
header="entry-id;kmMsisdn;serialNumber;kmSubscriptionType;kmSubscriptionType2;kmVoiceTan;kmDataTan;kmPaymentMethod;kmMccsDate;kmCustomerBlocked;kmNetworkOperatorBlocked;kmBlockedNetwork;kmMmpNoStatus;kmMmpM3cCreditLimit;kmMmpM3cStatus;kmMmpM3cStatusDate;kmMmpM3cRegistrationDate;creatorsName;createTimestamp;modifiersName;modifyTimestamp;kmBrandName;objectClass;cn;kmBlockedServices;kmServiceProvider" 
delimiter=";"
number_col=$(grep -o "$delimiter" <<< "$header" | wc -l)
number_col2=`expr "$number_col + 1" | bc`

#Create the new file 
v=$(basename $p_filename)
name=${v%.*}
extension=${v#*.}
p_shortFileName=$name
#Insert Header in file

p_newFileName="${p_PreprocessingPath}/${p_shortFileName}_Transposed.csv"
echo $header > $p_newFileName

#Create the matrix with the columns and their values


declare -A a
#Parse line by line the file
while read -r line;
do  
    var=$line
    #echo $line
    Column_Name=${var%:*}
    Column_Value=${var#*:}
    var="# entry-id"
    if [[ "$Column_Name" == "$var" && $Column_Value -ne 1 ]];
    then
        ((i++))
        if [ $i -gt 0 ];
        then
            z=$(($i-1))
            #Write the previous loaded record

            echo ${a[$z,0]} ${a[$z,1]} ${a[$z,2]} ${a[$z,3]} ${a[$z,4]} ${a[$z,5]} ${a[$z,6]} ${a[$z,7]} ${a[$z,8]} ${a[$z,9]} ${a[$z,10]} ${a[$z,11]} ${a[$z,12]} ${a[$z,13]} ${a[$z,14]} ${a[$z,15]} ${a[$z,16]} ${a[$z,17]} ${a[$z,18]} ${a[$z,19]} ${a[$z,20]} ${a[$z,21]} ${a[$z,22]} ${a[$z,23]} ${a[$z,24]} ${a[$z,25]} >> $p_newFileName

        fi
        c=0
        a[$i,0]=";"
        a[$i,1]=";"
        a[$i,2]=";"
        a[$i,3]=";"
        a[$i,4]=";"
        a[$i,5]=";"
        a[$i,6]=";"
        a[$i,7]=";"
        a[$i,8]=";"
        a[$i,9]=";"
        a[$i,10]=";"
        a[$i,11]=";"
        a[$i,12]=";"
        a[$i,13]=";"
        a[$i,14]=";"
        a[$i,15]=";"
        a[$i,16]=";"
        a[$i,17]=";"
        a[$i,18]=";"
        a[$i,19]=";"
        a[$i,20]=";"
        a[$i,21]=";"
        a[$i,22]=";"
        a[$i,23]=";"
        a[$i,24]=";"
        a[$i,25]=";"
        a[$i,26]=" "

        a[$i,0]="$Column_Value ;"
        #v[$i]=$i

    elif [[ $Column_Name == "kmMsisdn" && $i -gt -1 ]];
    then
        a[$i,1]="$Column_Value ;"
    elif [[ $Column_Name == "serialNumber" && $i -gt -1 ]];
    then
        a[$i,2]="$Column_Value ;"
    elif [[ $Column_Name == "kmSubscriptionType" && $i -gt -1 ]];
    then
        a[$i,3]="$Column_Value ;"
    elif [[ $Column_Name == "kmSubscriptionType2" && $i -gt -1 ]];
    then
        a[$i,4]="$Column_Value ;"
    elif [[ $Column_Name == "kmVoiceTan" && $i -gt -1 ]];
    then
        a[$i,5]="$Column_Value ;"
    elif [[ $Column_Name == "kmDataTan" && $i -gt -1 ]];
    then
        a[$i,6]="$Column_Value ;"
    elif [[ $Column_Name == "kmPaymentMethod" && $i -gt -1 ]];
    then
        a[$i,7]="$Column_Value ;"
    elif [[ $Column_Name == "kmMccsDate" && $i -gt -1 ]];
    then
        a[$i,8]="$Column_Value ;"
    elif [[ $Column_Name == "kmCustomerBlocked" && $i -gt -1 ]];
    then
        a[$i,9]="$Column_Value ;"
    elif [[ $Column_Name == "kmNetworkOperatorBlocked" && $i -gt -1 ]];
    then
        a[$i,10]="$Column_Value ;"
    elif [[ $Column_Name == "kmBlockedNetwork" && $i -gt -1 ]];
    then
        a[$i,11]="$Column_Value ;"
    elif [[ $Column_Name == "kmMmpNoStatus" && $i -gt -1 ]];
    then
        a[$i,12]="$Column_Value ;"
    elif [[ $Column_Name == "kmMmpM3cCreditLimit" && $i -gt -1 ]];
    then
        a[$i,13]="$Column_Value ;"
    elif [[ $Column_Name == "kmMmpM3cStatus" && $i -gt -1 ]];
    then
        a[$i,14]="$Column_Value ;"
    elif [[ $Column_Name == "kmMmpM3cStatusDate" && $i -gt -1 ]];
    then
        a[$i,15]="$Column_Value ;"
    elif [[ $Column_Name == "kmMmpM3cRegistrationDate" && $i -gt -1 ]];
    then
        a[$i,16]="$Column_Value ;"
    elif [[ $Column_Name == "creatorsName" && $i -gt -1 ]];
    then
        a[$i,17]="$Column_Value ;"
    elif [[ $Column_Name == "createTimestamp" && $i -gt -1 ]];
    then
        a[$i,18]="$Column_Value ;"
    elif [[ $Column_Name == "modifiersName" && $i -gt -1 ]];
    then
        a[$i,19]="$Column_Value ;"
    elif [[ $Column_Name == "modifyTimestamp" && $i -gt -1 ]];
    then
        a[$i,20]="$Column_Value ;"
    elif [[ $Column_Name == "kmBrandName" && $i -gt -1 ]];
    then
        a[$i,21]="$Column_Value ;"
    elif [[ $Column_Name == "objectClass" && $i -gt -1 ]];
    then
        if [ $c -eq 0 ];
        then 
        a[$i,22]="$Column_Value ;"
        ((c++))
        else
        a[$i,22]="$Column_Value+${a[$i,22]}"
        ((c++))
        fi
    elif [[ $Column_Name == "cn" && $i -gt -1 ]];
    then
        a[$i,23]="$Column_Value ;"
    elif [[ $Column_Name == "kmBlockedServices" && $i -gt -1 ]];
    then
        a[$i,24]="$Column_Value ;"
    elif [[ $Column_Name == "kmServiceProvider" && $i -gt -1 ]];
    then
        a[$i,25]="$Column_Value "
    fi
done < $ODSM_FILE 
#Write the last line of the matrix
echo ${a[$i,0]} ${a[$i,1]} ${a[$i,2]} ${a[$i,3]} ${a[$i,4]} ${a[$i,5]} ${a[$i,6]} ${a[$i,7]} ${a[$i,8]} ${a[$i,9]} ${a[$i,10]} ${a[$i,11]} ${a[$i,12]} ${a[$i,13]} ${a[$i,14]} ${a[$i,15]} ${a[$i,16]} ${a[$i,17]} ${a[$i,18]} ${a[$i,19]} ${a[$i,20]} ${a[$i,21]} ${a[$i,22]} ${a[$i,23]} ${a[$i,24]} ${a[$i,25]} >> $p_newFileName


echo "Created transposed file:  $p_newFileName ."

ls_current_date2=`date +'%Y-%m-%d'`
ls_current_time2=`date +'%H%M%S'`
ls_run_name2="${ls_current_date2}"_"${ls_current_time2}"
echo "Completed " 
echo "End time : $ls_run_name2 " >> "${p_PreprocessingPath}/log.txt"
`

Below you can find a sample of the file ( entry 1 is the header of the file and I do not need it at all ) .

version: 1

# entry-id: 1
dn: ou=CONNECTIONS,c=NL,o=Mobile
modifyTimestamp: 20130223124344Z
modifiersName: cn=directory manager
aci: (targetattr = "*") 

# entry-id: 3
dn: kmmsisdn=31653440000,ou=CONNECTIONS,c=NL,o=Mobile
modifyTimestamp: 20331210121726Z
modifiersName: cn=directory manager
cn: MCCS
kmBrandName: VOID
kmBlockedNetwork: N
kmNetworkOperatorBlocked: N
kmCustomerBlocked: N
kmMsisdn: 31653440000
objectClass: top
objectClass: device
objectClass: kmConnection
serialNumber: 204084400000000
kmServiceProvider: 1
kmVoiceTan: 25
kmSubscriptionType: FLEXI
kmPaymentMethod: ABO
kmMccsDate: 22/03/2004
nsUniqueId: 2b72cfe9-f8b221d9-80088800-00000000

# entry-id: 4
dn: kmmsisdn=31153128215,ou=CONNECTIONS,c=NL,o=Mobile
modifyTimestamp: 22231210103328Z
modifiersName: cn=directory manager
cn: MCCS
kmMmpM3cStatusDate: 12/01/2012
kmMmpM3cStatus: Potential
kmBrandName: VOID
kmBlockedNetwork: N
kmNetworkOperatorBlocked: N
kmCustomerBlocked: N
kmMsisdn: 31153128215
objectClass: top
objectClass: device
objectClass: kmConnection
objectClass: kmMultiMediaPortalService
serialNumber: 214283011000000
kmServiceProvider: 1
kmVoiceTan: 25
kmSubscriptionType: FLEXI
kmPaymentMethod: ABO
kmMccsDate: 22/03/2004
nsUniqueId: 92723fea-f8e211d9-8011000-01110000

If this is not achievable with shell scripting. Can you please suggest something that would do it faster ( perl, python ). I don't know any other scripting language but I can learn :) .

The shell script looks pretty good; you're using shell built-in features for parsing instead of calling external programs. Unfortunately, you've run up against the fact that shell just isn't that fast for processing large amounts of data. You'll be better off writing this in another language. — chepner
– chepner, Commented Sep 2, 2015 at 13:34
Does it help a little to use a switch (case "${Column_Name}" in ..)? Can you also remove the test on $i by parsing the lines before the first # entry-id before entering the loop? Do you also have a lot unused attibutes (preprocess using grep or use continue in the loop when 25 columns are filled) ? var="# entry-id" can be moved above the loop when you use another varname for it. And how about while IFS=: read -r Column_Name Column_Value — Walter A
– Walter A, Commented Sep 2, 2015 at 15:23
Can you use an array with 1 row (or set of var's) ? After filling the values for i and ((i++)) you never access the old rows. The array will take a lot of memory. — Walter A
– Walter A, Commented Sep 2, 2015 at 15:30
I have tried your last suggestion... still unacceptable. The other changes you are proposing don't have such a big impact on the performance, I believe. — bmcristi
– bmcristi, Commented Sep 3, 2015 at 7:39
Which shell matters. ksh93, for instance, is far (far!) faster than bash. — Charles Duffy
– Charles Duffy, Commented Sep 3, 2015 at 18:52

Peter Cordes · Accepted Answer · 2015-09-04 12:05:00Z

2

I said in comments that shell read is slow, and so is opening the output once per record.

Your shell-script version looks like it never empties its associative array, but also never reuses old entried. So eventually your shell will be using huge memory, because it keys each record's entries to a record counter.

You're just re-formatting records from blocks separated by empty lines to single lines with fields separated by spaces. This isn't hard, and doesn't require keeping previous records in memory.

I was thinking along the same lines as Walter A. This awk program is most of the way to solving the problem.

Note the delete a after printing the record into a csv line, to clear the fields.

awk   -vOFS=' ; ' -F'\\s*:\\s*' '/^#/{print; this_is_for_debugging }
    function output_rec(){ print a["kmMsisdn"], a["serialNumber"], a["kmSubscriptionType"], a["objectClass"] }
    /^$/ { output_rec(); delete a;next}
    END  { output_rec() }
    {  sub(/\s+$/, "", $2);  # strip trailing whitespace if needed
       if ($1 == "objectClass" && a[$1] )
           { a[$1]= (a[$1] "+" $2) } else { a[$1]=$2; }
    }' foo.etl

I'll leave it up to you to print the rest of the fields. (They're already getting parsed, by the a[$1] = $2 statement, in the else block of the "objectClass" condition.)

Splitting on whitespace*:whitespace* means we don't have to bother stripping whitespace at the start of the 2nd field. Apparently the -F arg needs doubled backslashes. It's probably a good idea to add a check that NF <= 2, to make sure there aren't any lines with multiple :.

Output for your sample input

 ;  ;  ; 
# entry-id: 1
 ;  ;  ; 
# entry-id: 3
31653440000 ; 204084400000000 ; FLEXI ; top+device+kmConnection
# entry-id: 4
31153128215 ; 214283011000000 ; FLEXI ; top+device+kmConnection+kmMultiMediaPortalService

To avoid data duplication between printing the header line and printing the fields, you could put the field names in an array, and loop over them in both places.

I was originally thinking that -v RS='\n\n' would be useful, to make every block an AWK record. Actually, that might still be useful, with FS='\n'. Then you can loop over fields (lines of each record), and split it on :. If it's impossible for a record to contain a :, like your shell script assumes, then the splitting is easy with split (same as we're doing with -F to set FS).

(In your shell version, use Column_Name=${var%%:*} to remove the longest suffix (including all the :s), instead of the shortest. Or use IFS=: read Column_Name Column_Value)

This might be better written in perl, since it's getting bulky for an awk program. perl would make it easier to do the splitting on only the first : on the line.

edited Sep 4, 2015 at 12:05

answered Sep 3, 2015 at 9:11

Peter Cordes

377k50 gold badges745 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

bmcristi Over a year ago

Thanks! This works, with some minor issues. When I put all the fields I get a really long empty string after "N" on record 3 on the column kmNetworkOperatorBlocked. Also, entry-id must be put on the same level with the other fields : 3;31653440000;FLEXI....

Peter Cordes Over a year ago

@bmcristi: entry-id lines are printed verbatim for debugging, to make it easier to see which empty line triggered the printing. The /^#/ {print} rule matches lines that start with a #. re: long whitespace. Probably your input file has that much literal whitespace. You could remove leading/trailing whitespace sub(/^\s*/, "", $2); sub(/\s*$/, "", $2);. Or in perl, chomp($fieldval). Prob. getting field splitting to each the whitespace is better: FS='\s*:\s*

bmcristi Over a year ago

Thanks a lot again . It works great. I achieved 6s for 1.000.000 rows so it should do the 500 mil in about an hour or less

Peter Cordes Over a year ago

@bmcristi: cheers. Don't forget to accept this answer, if you think it's the best one. (checkbox under the up/down vote arrows).

bmcristi Over a year ago

After testing this version I found that the objectClass and kmMmpM3cStatus columns are not populated.

|

bmcristi · Accepted Answer · 2015-09-04 10:31:25Z

1

awk -vOFS=' ; ' -F: '
 function output_rec(){ gsub(/[ \t]+$/, "",$2);
 print a["entry-id"],a["kmMsisdn"],a["kmSubscriptionType"],a["kmSubscriptionType2"],a["kmVoiceTan"],a["kmDataTan"],a["kmPaymentMethod"],a["kmMccsDate"],a["kmCustomerBlocked"],a["kmNetworkOperatorBlocked"],a["kmBlockedNetwork"],a["kmMmpNoStatus"],a["kmMmpM3cCreditLimit"],a["kmMmpM3cStatus"],a["kmMmpM3cStatusDate"],a["kmMmpM3cRegistrationDate"],a["creatorsName"],a["createTimestamp"],a["modifiersName"],a["modifyTimestamp"],a["kmBrandName"],a["objectClass"],a["cn"],a["kmBlockedServices"],a["kmServiceProvider"]}
 /entry-id/ {output_rec(); delete a;a["entry-id"]=$2;next}
 END  { output_rec() }
   {gsub(/[ \t]+$/, "",$2);
   if ($1 == "objectClass" && a[$1] ) { a[$1]= (a[$1]"+"$2) } else { a[$1]=$2; } }' $ODSM_FILE >> $p_newFileName

edited Sep 4, 2015 at 10:31

answered Sep 4, 2015 at 10:06

bmcristi

1037 bronze badges

5 Comments

bmcristi Over a year ago

I have used before if gsub(/[ \t]+$/, "",$2); to integrate the removal of spaces. Edited the post for final version

Peter Cordes Over a year ago

My awk -F'\\s*:\\s*'handles space at the beginning of fields. You left that out, and just split fields on :. Your gsub only matches at the ends of strings. Also, putting it in output_rec() is weird. You're calling output_rec() on entry-id lines, so if you want $2 modified for those lines, you should put the call next to where you use $2 in that rule.

bmcristi Over a year ago

Guys, for your knowledge, I managed to do the transposing in 28 m 25 s for 531.280.227 rows in the source file ( it produced 17.422.327 transposed records) and reduced the file size from arroung 13 GB to arround 4 GB. It works just fine for what I need. A lot of thanks to you, especially Peter and Walter.

Walter A Over a year ago

Very nice, 200 times faster than your first script! Don't forget to accept the answer, you think it's the best one. (and/or upvote other answers you like).

bmcristi Over a year ago

@Walter Ok. But I need to reach reputation 15 before it displays :)

Sobrique · Accepted Answer · 2015-09-04 13:55:05Z

1

With perl I'd approach it like this:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV;

#configure output columns and ordering.
my @output_cols = qw (
    entry-id kmMsisdn serialNumber
    kmSubscriptionType kmSubscriptionType2
    kmVoiceTan kmDataTan kmPaymentMethod
    kmMccsDate kmCustomerBlocked
    kmNetworkOperatorBlocked kmBlockedNetwork
    kmMmpNoStatus kmMmpM3cCreditLimit
    kmMmpM3cStatus kmMmpM3cStatusDate
    kmMmpM3cRegistrationDate creatorsName
    createTimestamp modifiersName
    modifyTimestamp kmBrandName
    objectClass cn
    kmBlockedServices kmServiceProvider
);

#set up our csv engine - separator of ';' particularly. 
#eol will put a linefeed after each line (might want "\r\n" on DOS)
my $csv = Text::CSV->new(
    {   sep_char => ';',
        eol      => "\n",
        binary   => 1
    }
);

#open output
open( my $output, '>', 'output_file.csv' ) or die $!;
#print header row. 
$csv->print( $output, \@output_cols );
#set columns, so print_hr knows ordering. 
$csv->column_names(@output_cols);

#set record separator to double linefeed
local $/ = "\n\n";

#iterate the 'magic' filehandle. 
#this either reads data piped on `STDIN` _or_ a list of files specified on 
#command line. 
#e.g. myscript.pl file_to_process 
#or 
#cat file_to_process | myscript.pl
#this thus emulates awk/grep/sed etc.
#NB works one record at a time - so a chunk all the way to a double line feed. 

while (<>) {
    #pattern match the key-value pairs on this chunk of data (record).
    #multi-line block.
    #because this regex will return a list of paired values (note - "g" and "m" flags), we can
    #insert it directly into a hash (associative array)
    my %row = m/^(?:# )?([-\w]+): (.*)$/mg;

    #skip if this row is incomplete. Might need to be entry-id? 
    next unless $row{'kmMsisdn'};
    $csv->print_hr( $output, \%row );
}
close ( $output );

This generates:

entry-id;kmMsisdn;serialNumber;kmSubscriptionType;kmSubscriptionType2;kmVoiceTan;kmDataTan;kmPaymentMethod;kmMccsDate;kmCustomerBlocked;kmNetworkOperatorBlocked;kmBlockedNetwork;kmMmpNoStatus;kmMmpM3cCreditLimit;kmMmpM3cStatus;kmMmpM3cStatusDate;kmMmpM3cRegistrationDate;creatorsName;createTimestamp;modifiersName;modifyTimestamp;kmBrandName;objectClass;cn;kmBlockedServices;kmServiceProvider
3;31653440000;204084400000000;FLEXI;;25;;ABO;22/03/2004;N;N;N;;;;;;;;"cn=directory manager";20331210121726Z;VOID;kmConnection;MCCS;;1
4;31153128215;214283011000000;FLEXI;;25;;ABO;22/03/2004;N;N;N;;;Potential;12/01/2012;;;;"cn=directory manager";22231210103328Z;VOID;kmMultiMediaPortalService;MCCS;;1

Note: Because we're using while ( <> ) { we can use this script like you would awk/sed. perl uses that operator as either:

Data piped in
open files specified on command line and read them.

So you can:

./myscript.pl filename1 filename2

or

somecommand_to_generate_data | ./myscript.pl

edited Sep 4, 2015 at 13:55

answered Sep 4, 2015 at 10:33

Sobrique

53.6k8 gold badges63 silver badges107 bronze badges

4 Comments

Peter Cordes Over a year ago

Looks pretty good, and less clunky than my awk version. I thought about switching to perl part way through. Why did you write it with the data embedded in the script? The OP said he might be interested in learning perl, but I think that's going to be confusing. Also, IDK if we should assume that kmMsisdn will always be present in every record. The OP is using the # entry-id: 1 lines in his current awk attempt, so I guess we can assume that every record has one of those. Or you could just check if the hash was empty, right?

Sobrique Over a year ago

I embedded it because that way it's self contained - I can just run it in my IDE and fiddle around to get it working. It also works nicely as a drop in replacement for while ( <> ) { which I like for it's functionality, but think it's even less obvious what it's doing ;). I did look at entry-id, but the first chunk doesn't really have anything interesting it it. Where 3 and 4 look more like records. Hopefully that's sufficiently obvious though.

Sobrique Over a year ago

Edited to make a little more 'ready for use'. (And comments, because you can never have enough comments)

bmcristi Over a year ago

Thanks for this too ! Still, I have gone with the awk solution.

Anthony Geoghegan · Accepted Answer · 2015-09-02 13:34:50Z

0

That’s an impressive shell script but the problem you’re solving is not a good fit for a traditional shell script. I’d imagine that using echo and output redirection for all the file writes would dramatically slow things down. With a proper programming language, you could buffer your file writes – and read in more than one line at a time.

You’ve already mentioned Perl and Python and these are exactly what I’d suggest. Both languages are used by system administrators though Python seems to be more favoured by data scientists. I’ve leaned both and Python is also my personal favourite as I like its syntax, similarity to pseudo-code and how most of its libraries that I’ve used were easy to use – and read.

Good luck with learning whichever language you choose. (A discussion on which language would be best would probably lead to this question being closed for being too opinion-based).

answered Sep 2, 2015 at 13:34

Anthony Geoghegan

12.1k5 gold badges54 silver badges59 bronze badges

7 Comments

bmcristi Over a year ago

The thing is that I am doing this transposing to a csv in order to make the file readable by an ETL graph to store the data in a DB table. The file has the extension .LDIF . If there is no way of loading it using scripting + ETL, is there any other way , SQL loader or something like that ? The initial request is to load it in one day in the DB, and perform validations on the fields ( Data may be corrupt ), but I guess if it is not possible to do it in that time frame, the validations can be dropped (if the reading process won't fail).

Peter Cordes Over a year ago

@AnthonyGeoghegan: shell read is far worse than reading one line at a time. It doesn't know where line boundaries are, and it has to avoid overshoot. It makes one read system call per character. This is part of why bash is far slower than something like awk for text processing. Pure-bash is great when you can modify the values of variables faster than the overhead of running a process, but it's not the way to go for bulk text.

Peter Cordes Over a year ago

@bmcristi: You already wrote a program to do what you need in pure bash, with just some text processing. Why would awk/perl/python/some other scripting language have any trouble doing the same thing, but with buffered I/O? (Besides shell read being fatally slow, you're closing/re-opening the output file every line, because the redirection is inside the loop).

Anthony Geoghegan Over a year ago

@PeterCordes Thanks for that info. I've never actually used read for reading from anything other than a terminal. I've always used sed, awk or Python for reading data from files so I didn't realise that read was so bad.

Peter Cordes Over a year ago

If you ever have read a data stream in a shell loop, see mywiki.wooledge.org/BashFAQ/001 for how to not mess up your data. :P bmcristi has it right, other than leaving IFS at the default. (Which may be intentional, to trim leading and trailing whitespace). The OPs code does omit quoting filenames, and arguments to echo, though.

|

Walter A · Accepted Answer · 2015-09-03 20:11:34Z

0

You can try it with awk.
awk has associative arrays, so you can use something like -F: '{row[$1]=$2}' for normal rows. You can print/reset when you have a new set.

/entry-id/ '{print row["kmMsisdn"], " , ", row["serialNumber"], " , ", row["kmSubscriptionType]}'

and when supported make the array empty by deleting it {delete row}.

It should be a lot faster than your current version.

EDIT

I looked at the answer of @Peter and just edited his solution.
I added the other fields, used $ODSM_FILE, $p_newFileName, and changed the logic for finding a new record:
After each line with an entry-id
Credit to Peter for the awk code and his explanation.

  awk -vOFS=' ; ' -F'\\s*:\\s*' ' BEGIN {
                a["entry-id"]="entry-id"; 
                a["kmMsisdn"]="kmMsisdn"; 
                a["serialNumber"]="serialNumber";
                a["kmSubscriptionType"]="kmSubscriptionType";
                a["kmSubscriptionType2"]="kmSubscriptionType2";
                a["kmVoiceTan"]="kmVoiceTan";                  
                a["kmDataTan"]="kmDataTan";                    
                a["kmPaymentMethod"]="kmPaymentMethod";        
                a["kmMccsDate"]="kmMccsDate";                  
                a["kmCustomerBlocked"]="kmCustomerBlocked";    
                a["kmNetworkOperatorBlocked"]="kmNetworkOperatorBlocked";
                a["kmBlockedNetwork"]="kmBlockedNetwork";                
                a["kmMmpNoStatus"]="kmMmpNoStatus";                      
                a["kmMmpM3cCreditLimit"]="kmMmpM3cCreditLimit";          
                a["kmMmpM3cStatus"]="kmMmpM3cStatus";                    
                a["kmMmpM3cStatusDate"]="kmMmpM3cStatusDate";            
                a["kmMmpM3cRegistrationDate"]="kmMmpM3cRegistrationDate";
                a["creatorsName"]="creatorsName";                        
                a["createTimestamp"]="createTimestamp";                  
                a["modifiersName"]="modifiersName";
                a["modifyTimestamp"]="modifyTimestamp";
                a["kmBrandName"]="kmBrandName";
                a["objectClass"]="objectClass";
                a["cn"]="cn";
                a["kmBlockedServices"]="kmBlockedServices";
                a["kmServiceProvider"]="kmServiceProvider";
        }
        function output_rec(){ print a["entry-id"],
                a["kmMsisdn"],
                a["serialNumber"],
                a["kmSubscriptionType"],
                a["kmSubscriptionType2"],
                a["kmVoiceTan"],
                a["kmDataTan"],
                a["kmPaymentMethod"],
                a["kmMccsDate"],
                a["kmCustomerBlocked"],
                a["kmNetworkOperatorBlocked"],
                a["kmBlockedNetwork"],
                a["kmMmpNoStatus"],
                a["kmMmpM3cCreditLimit"],
                a["kmMmpM3cStatus"],
                a["kmMmpM3cStatusDate"],
                a["kmMmpM3cRegistrationDate"],
                a["creatorsName"],
                a["createTimestamp"],
                a["modifiersName"],
                a["modifyTimestamp"],
                a["kmBrandName"],
                a["objectClass"],
                a["cn"],
                a["kmBlockedServices"],
                a["kmServiceProvider"] }
        END  { output_rec() }
        /^$/ { next }
        /entry-id/ {output_rec();delete a; a["entry-id"]=$2;next}
        {
           sub(/\s*$/, "", $2); # strip trailing whitespace
           if ($1 == "objectClass") { a[$1]= (a[$1]"+"$2) } else { a[$1]=$2; }
        }'  $ODSM_FILE > $p_newFileName

I tested it with 25.000 lines data, and the awk code was 30 times faster than the original code. For an input file of 11 million lines the awk solution needed 40s on my system.
@Peter: Good job!

edited Sep 3, 2015 at 20:11

answered Sep 2, 2015 at 20:20

Walter A

20.2k2 gold badges29 silver badges46 bronze badges

8 Comments

bmcristi Over a year ago

Will try that. Thanks!

bmcristi Over a year ago

Can you please give me the full syntax for using your suggestion?

Peter Cordes Over a year ago

use awk -v OFS=' , ' so output fields are separated by commas without having to include them literally as args to print. @Walter: your code is a lot more readable than the the OP's. An associative array make it easy to see he's just putting fields in order, as well as onto one line.

Walter A Over a year ago

I can try to make it over 12 hours or maybe someone else can do it now. I am curious about your original script: Will it be a lot faster when you change ((i++)) into i=0 (saving memory and still skipping the first lines) ? And (see comment of @Cordes on another answer) move the redirection of the echo from the loop to the done statement: done >> $ ${p_newFileName} < ${ODSM_FILE}

Walter A Over a year ago

I was glad to see the Solution of Peter and made some small changes.

|

Collectives™ on Stack Overflow

Parse huge file shell ( or other scripting language )

5 Answers 5

7 Comments

5 Comments

4 Comments

7 Comments

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

7 Comments

5 Comments

4 Comments

7 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related