I've been looking at reading a relatively large text file including columns of numbers interspersed with some other text, though really I just want the columns of numbers. There's a bunch of other text not shown here that's not at such regular intervals.
The file format:
*** LOTS OF OTHER TEXT AND NUMBERS ***
iter continuity x-velocity y-velocity k epsilon vf-vapour_ph time/iter
111 3.4714e-08 5.3037e-10 6.0478e-10 1.6219e-15 1.8439e-13 0.0000e+00 0:00:01 14
112 3.2652e-08 5.0553e-10 5.6497e-10 1.3961e-15 1.5730e-13 0.0000e+00 0:00:01 13
113 3.1371e-08 4.6175e-10 5.0506e-10 1.2020e-15 1.3419e-13 0.0000e+00 0:00:01 12
114 3.0016e-08 4.4331e-10 4.7391e-10 1.0388e-15 1.1447e-13 0.0000e+00 0:00:01 11
115 2.8702e-08 4.2111e-10 4.4778e-10 8.9904e-16 9.7680e-14 0.0000e+00 0:00:01 10
116 2.7476e-08 4.1484e-10 4.2711e-10 7.7955e-16 8.3342e-14 0.0000e+00 0:00:01 9
117 2.6436e-08 3.9556e-10 4.0601e-10 6.7890e-16 7.1113e-14 0.0000e+00 0:00:01 8
118 2.5374e-08 3.8633e-10 3.8826e-10 5.9234e-16 6.0674e-14 0.0000e+00 0:00:00 7
119 2.4292e-08 3.7473e-10 3.7584e-10 5.1814e-16 5.1786e-14 0.0000e+00 0:00:00 6
120 2.3474e-08 3.5952e-10 3.5622e-10 4.5405e-16 4.4207e-14 0.0000e+00 0:00:00 5
121 2.2612e-08 3.4485e-10 3.4159e-10 3.9910e-16 3.7707e-14 0.0000e+00 0:00:00 4
iter continuity x-velocity y-velocity k epsilon vf-vapour_ph time/iter
122 2.1992e-08 3.4100e-10 3.2964e-10 3.5272e-16 3.2204e-14 0.0000e+00 0:00:00 3
123 2.1592e-08 3.2444e-10 3.0170e-10 3.1487e-16 2.7500e-14 0.0000e+00 0:00:00 2
124 2.1053e-08 3.3145e-10 2.9325e-10 2.8009e-16 2.3485e-14 0.0000e+00 0:00:00 1
125 2.0390e-08 3.1502e-10 2.7534e-10 2.5433e-16 2.0053e-14 0.0000e+00 0:00:00 0
step flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
1 5.0000e-07 -5.5662e-08 1.4217e-07 6.0015e+00 5.9998e+00 6.0015e+00 5.9998e+00 2.8934e-04 3.3491e-10
Flow time = 5e-07s, time step = 1
799 more time steps
Updating solution at time levels N and N-1.
done.
Writing data to output file.
Current time=0.000000 Position=-0.00000036409265555078 Velocity=0.000015 Net force=0.210322
Fluid force=-0.477050N, Stator force=0.200000N ,Spring force=-32.990534N ,Top force=0.000000N, Bottom force=33.007906N, External force=0.470000N
Next time=0.000001 Position=-0.00000036400170391852 Velocity=0.000182
Applying motion to dynamic zone.
*** CONTINUING TEXT AND NUMBERS ***
The lines I want are:
111 3.4714e-08 5.3037e-10 6.0478e-10 1.6219e-15 1.8439e-13 0.0000e+00 0:00:01 14
112 3.2652e-08 5.0553e-10 5.6497e-10 1.3961e-15 1.5730e-13 0.0000e+00 0:00:01 13
The script I have so far works, but takes about 80s to do the whole thing.
Made more awkward, I presume, by the colons in the time which are there in some of my files. Some files will have more or less columns containing different types of data, and some will have the additional set at the end of the main chunk such as:
step flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
1 5.0000e-07 -5.5662e-08 1.4217e-07 6.0015e+00 5.9998e+00 6.0015e+00 5.9998e+00 2.8934e-04 3.3491e-10
I'm not looking to get this data, but it can have a very similar (sometimes the same) format as the lines I want.
It's essentially aiming to read each line and see if the few characters at the front of the line (based on the length of the iteration number) match the ones I'd be expecting (starting with 1, 2, 3... n). The reason I've done it this way is to try and remove the lines under "step..." which I don't want. However, the file is about 180,000 lines long (and it's my shortest) so you can imagine this gets a little slow.
% read the raw data from the file
file = 'file.txt';
fid = fopen(file, 'r');
raw = textscan(fid, '%s', 'Delimiter', '\n');
fid = fclose(fid);
raw = raw{1,1};
% expression used for splitting the columns up
colExpr = '[\d\.e:\-\+]+';
% beginning number
iterNum = 1;
% loop through lines
for line = 1:length(raw);
% convert to string for comparison
iterStr = num2str(iterNum);
thisLine = raw{line, 1};
% if the right length and the right string,
if length(iterStr) <= length(thisLine) && ...
strcmp(thisLine(1:length(iterStr)), iterStr)
% split the string
result(iterNum,:) = regexp(thisLine,colExpr, 'match');
iterNum = iterNum + 1;
end
end
% convert to matrix
residuals = cellfun(@str2num, result);
Using the profiler, I realise that the num2str() function is the slowest part (20s), followed by int2str() (10s), though I can't see a way of reading the data without it being part of the loop.
Wondering if there's something I'm missing to try and optimise this process?
EDIT:
I've included more of the lines that I don't want and a possible different format to try and help answers.
load -ascii file.txt