0

I'm trying to write a script that breaks up a VERY large file into smaller pieces that are then sent to a script that runs in the background. The motivation is that if the script is running in the background, I can run in parallel.

Here is my code, ./seq works just like the normal seq command (which mac doesn't have). and $1 is the huge file to be split.

echo "Splitting and Running Script"

for i in $(./seq 0 14000000 500000)
do
   awk ' { if (NR>='$i' && NR<'$(($i+500000))') { print $0 > "xPart'$i'" }  }' $1 
   python FastQ2Seq.py xPart$i &
done

wait

echo "Concatenating"

for k in *.out.seq
do
cat $k >> original.seq
done

for j in *.out.qul
do
cat $j >> original.qul
done

echo "Cleaning"
rm xPart*

My problem is that only xPart0 is made and it only has 499995 lines in it before the program hangs. I put some debugging echos in the script and I know the awk statement is what stops the script. I just can't figure out what's going wrong.

3
  • Any reason you can't use split -l 500000? Commented Feb 19, 2010 at 7:13
  • Instead of seq, OS X has jot. Or, in Bash, for ((i=0; i<=14000000; i+=500000)) Commented Feb 19, 2010 at 10:36
  • split is way too slow. My file is 3.6GB, split can't handle it. Commented Feb 22, 2010 at 16:05

3 Answers 3

1

Check out the split command --

  split -- split a file into pieces

  Output  fixed-size  pieces of INPUT to PREFIXaa, PREFIXab, ...; default
  size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or  when
  INPUT is -, read standard input.

Should be much faster, reliable, and cleaner than running awk in a loop!

Sign up to request clarification or add additional context in comments.

1 Comment

Tch. At least point to the right man page :P developer.apple.com/Mac/library/documentation/Darwin/Reference/…
0
echo "Splitting and Running Script"
# splits to smaller files each 50000 lines, if i understand your problem correctly
awk 'NR%50000==1{++c}{print $0 > "xPart"c".txt"}' file
# or use split -l 50000 
for file in xPart*
do
    python FastQ2Seq.py "$file" &
done
echo "Concatenating"
cat *.out.seq >> original.seq
cat *.out.qul >> original.qul

1 Comment

This was really close. I ended up doing awk '{if (NR%500000==1){++c}{print $0 > "xPart"c}}' $1
0

If your seq truly works like the standard seq, you're calling it wrong. The proper command line for seq is:

seq FIRST INCREMENT LAST

So you would need to change your seq commandline to:

seq 0 500000 14000000

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.