mac unix script problem

Question

I'm trying to write a script that breaks up a VERY large file into smaller pieces that are then sent to a script that runs in the background. The motivation is that if the script is running in the background, I can run in parallel.

Here is my code, ./seq works just like the normal seq command (which mac doesn't have). and $1 is the huge file to be split.

echo "Splitting and Running Script"

for i in $(./seq 0 14000000 500000)
do
   awk ' { if (NR>='$i' && NR<'$(($i+500000))') { print $0 > "xPart'$i'" }  }' $1 
   python FastQ2Seq.py xPart$i &
done

wait

echo "Concatenating"

for k in *.out.seq
do
cat $k >> original.seq
done

for j in *.out.qul
do
cat $j >> original.qul
done

echo "Cleaning"
rm xPart*

My problem is that only xPart0 is made and it only has 499995 lines in it before the program hangs. I put some debugging echos in the script and I know the awk statement is what stops the script. I just can't figure out what's going wrong.

Instead of seq, OS X has jot. Or, in Bash, for ((i=0; i<=14000000; i+=500000)) — Dennis Williamson
– Dennis Williamson, Commented Feb 19, 2010 at 10:36
split is way too slow. My file is 3.6GB, split can't handle it. — ACEnglish
– ACEnglish, Commented Feb 22, 2010 at 16:05

Steven Schlansker · Accepted Answer · 2010-02-19 07:14:30Z

1

Check out the split command --

  split -- split a file into pieces

  Output  fixed-size  pieces of INPUT to PREFIXaa, PREFIXab, ...; default
  size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or  when
  INPUT is -, read standard input.

Should be much faster, reliable, and cleaner than running awk in a loop!

answered Feb 19, 2010 at 7:14

Steven Schlansker

38.7k14 gold badges85 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ignacio Vazquez-Abrams Over a year ago

Tch. At least point to the right man page :P developer.apple.com/Mac/library/documentation/Darwin/Reference/…

ghostdog74 · Accepted Answer · 2010-02-19 07:33:05Z

0

echo "Splitting and Running Script"
# splits to smaller files each 50000 lines, if i understand your problem correctly
awk 'NR%50000==1{++c}{print $0 > "xPart"c".txt"}' file
# or use split -l 50000 
for file in xPart*
do
    python FastQ2Seq.py "$file" &
done
echo "Concatenating"
cat *.out.seq >> original.seq
cat *.out.qul >> original.qul

edited Feb 19, 2010 at 7:33

answered Feb 19, 2010 at 7:27

ghostdog74

346k62 gold badges264 silver badges349 bronze badges

1 Comment

ACEnglish Over a year ago

This was really close. I ended up doing awk '{if (NR%500000==1){++c}{print $0 > "xPart"c}}' $1

R Samuel Klatchko · Accepted Answer · 2010-02-19 07:47:51Z

0

If your seq truly works like the standard seq, you're calling it wrong. The proper command line for seq is:

seq FIRST INCREMENT LAST

So you would need to change your seq commandline to:

seq 0 500000 14000000

answered Feb 19, 2010 at 7:47

R Samuel Klatchko

77k17 gold badges139 silver badges189 bronze badges

Collectives™ on Stack Overflow

mac unix script problem

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related