MATLAB data parse optimisation

Question

I've been looking at reading a relatively large text file including columns of numbers interspersed with some other text, though really I just want the columns of numbers. There's a bunch of other text not shown here that's not at such regular intervals.

The file format:

*** LOTS OF OTHER TEXT AND NUMBERS ***

  iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph     time/iter
   111  3.4714e-08  5.3037e-10  6.0478e-10  1.6219e-15  1.8439e-13  0.0000e+00  0:00:01   14
   112  3.2652e-08  5.0553e-10  5.6497e-10  1.3961e-15  1.5730e-13  0.0000e+00  0:00:01   13
   113  3.1371e-08  4.6175e-10  5.0506e-10  1.2020e-15  1.3419e-13  0.0000e+00  0:00:01   12
   114  3.0016e-08  4.4331e-10  4.7391e-10  1.0388e-15  1.1447e-13  0.0000e+00  0:00:01   11
   115  2.8702e-08  4.2111e-10  4.4778e-10  8.9904e-16  9.7680e-14  0.0000e+00  0:00:01   10
   116  2.7476e-08  4.1484e-10  4.2711e-10  7.7955e-16  8.3342e-14  0.0000e+00  0:00:01    9
   117  2.6436e-08  3.9556e-10  4.0601e-10  6.7890e-16  7.1113e-14  0.0000e+00  0:00:01    8
   118  2.5374e-08  3.8633e-10  3.8826e-10  5.9234e-16  6.0674e-14  0.0000e+00  0:00:00    7
   119  2.4292e-08  3.7473e-10  3.7584e-10  5.1814e-16  5.1786e-14  0.0000e+00  0:00:00    6
   120  2.3474e-08  3.5952e-10  3.5622e-10  4.5405e-16  4.4207e-14  0.0000e+00  0:00:00    5
   121  2.2612e-08  3.4485e-10  3.4159e-10  3.9910e-16  3.7707e-14  0.0000e+00  0:00:00    4
  iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph     time/iter
   122  2.1992e-08  3.4100e-10  3.2964e-10  3.5272e-16  3.2204e-14  0.0000e+00  0:00:00    3
   123  2.1592e-08  3.2444e-10  3.0170e-10  3.1487e-16  2.7500e-14  0.0000e+00  0:00:00    2
   124  2.1053e-08  3.3145e-10  2.9325e-10  2.8009e-16  2.3485e-14  0.0000e+00  0:00:00    1
   125  2.0390e-08  3.1502e-10  2.7534e-10  2.5433e-16  2.0053e-14  0.0000e+00  0:00:00    0
  step  flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
     1  5.0000e-07 -5.5662e-08  1.4217e-07  6.0015e+00  5.9998e+00  6.0015e+00  5.9998e+00  2.8934e-04  3.3491e-10
Flow time = 5e-07s, time step = 1
799 more time steps

Updating solution at time levels N and N-1.
 done.


Writing data to output file.
Current time=0.000000  Position=-0.00000036409265555078  Velocity=0.000015  Net force=0.210322
Fluid force=-0.477050N, Stator force=0.200000N ,Spring force=-32.990534N ,Top force=0.000000N, Bottom force=33.007906N, External force=0.470000N

Next time=0.000001  Position=-0.00000036400170391852  Velocity=0.000182
Applying motion to dynamic zone.

*** CONTINUING TEXT AND NUMBERS ***

The lines I want are:

111  3.4714e-08  5.3037e-10  6.0478e-10  1.6219e-15  1.8439e-13  0.0000e+00  0:00:01   14
112  3.2652e-08  5.0553e-10  5.6497e-10  1.3961e-15  1.5730e-13  0.0000e+00  0:00:01   13

The script I have so far works, but takes about 80s to do the whole thing.

Made more awkward, I presume, by the colons in the time which are there in some of my files. Some files will have more or less columns containing different types of data, and some will have the additional set at the end of the main chunk such as:

  step  flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
     1  5.0000e-07 -5.5662e-08  1.4217e-07  6.0015e+00  5.9998e+00  6.0015e+00  5.9998e+00  2.8934e-04  3.3491e-10

I'm not looking to get this data, but it can have a very similar (sometimes the same) format as the lines I want.

It's essentially aiming to read each line and see if the few characters at the front of the line (based on the length of the iteration number) match the ones I'd be expecting (starting with 1, 2, 3... n). The reason I've done it this way is to try and remove the lines under "step..." which I don't want. However, the file is about 180,000 lines long (and it's my shortest) so you can imagine this gets a little slow.

% read the raw data from the file
file = 'file.txt';
fid = fopen(file, 'r');
raw = textscan(fid, '%s', 'Delimiter', '\n');
fid = fclose(fid);
raw = raw{1,1};

% expression used for splitting the columns up
colExpr = '[\d\.e:\-\+]+';

% beginning number
iterNum = 1;

% loop through lines
for line = 1:length(raw);

    % convert to string for comparison
    iterStr = num2str(iterNum);
    thisLine = raw{line, 1};

    % if the right length and the right string,
    if length(iterStr) <= length(thisLine) && ...
            strcmp(thisLine(1:length(iterStr)), iterStr)

        % split the string
        result(iterNum,:) = regexp(thisLine,colExpr, 'match');

        iterNum = iterNum + 1;

    end

end

% convert to matrix
residuals = cellfun(@str2num, result);

Using the profiler, I realise that the num2str() function is the slowest part (20s), followed by int2str() (10s), though I can't see a way of reading the data without it being part of the loop.

Wondering if there's something I'm missing to try and optimise this process?

EDIT:

I've included more of the lines that I don't want and a possible different format to try and help answers.

Can you show us what some of that "other text and numbers" looks like? Is it at least a consistent number of lines of text to ignore? — Suever
– Suever, Commented Apr 13, 2016 at 15:25
None of it is particularly consistent, which is why I opted for looking for what I wanted, rather than ignoring what I didn't, so I'm not sure if it'd help? — Laurengineer
– Laurengineer, Commented Apr 13, 2016 at 15:26
you could also preprocess the file to remove text lines keeping only the numbers before loading it into MATLAB (something like grep, sed, or awk can easily do this). You would then import the file in MATLAB very quickly with one line of code load -ascii file.txt — Amro
– Amro, Commented Apr 13, 2016 at 18:57

Amro · Accepted Answer · 2016-04-13 19:27:23Z

Here is a different approach: we first process the file externally, with something like:

# only keep lines starting with a digit
$ grep '^\s*[0-9]' file.txt > file2.txt

On Windows, you can use findstr as equivalent to grep:

C:\> findstr /R /c:"^[ \t]*[0-9]" file.txt > file2.txt

Now in MATLAB, it's easy to load the resulting numeric data as a matrix:

>> load -ascii file2.txt
>> t = array2table(file2, 'VariableNames',...
    {'iter','continuity','xvelocity','yvelocity','k','epsilon','vf_vapour_ph'})
t = 
    iter    continuity    xvelocity     yvelocity        k          epsilon      vf_vapour_ph
    ____    __________    __________    _________    __________    __________    ____________
     1             0      6.2376e-07            0     0.0018988        2708.2    0           
     2             0         0.21656      0.23499     0.0097531       0.13395    0           
     3             0         0.11755      0.12824     0.0032109        0.1146    0           
     4             0        0.068112     0.072691    0.00089801      0.062219    0           
     5             0        0.043498     0.045244    0.00020248      0.025923    0           
     6        0.1938        0.029107     0.029029    4.8399e-05     0.0099171    0           
     7       0.13594        0.020037     0.019577    1.5502e-05     0.0043624    0           
     8      0.097518        0.013805     0.013249    5.1736e-06     0.0023341    0           
     9      0.070467       0.0098312    0.0091925    1.8272e-06     0.0012615    0           
    10      0.051538       0.0071181    0.0064673    7.2446e-07     0.0007012    0           
    11      0.038065       0.0052115    0.0046128    4.2786e-07    0.00040619    0           
    12      0.028369       0.0038465    0.0033381    2.8256e-07    0.00025864    0           
    13      0.021326        0.002857    0.0024454    1.9279e-07    0.00016126    0

Suever · Accepted Answer · 2016-04-13 16:44:25Z

1

Since you have the entire thing loaded into a cell array already (raw) you can call regexp directly on this to remove the bad rows.

%// Find lines that contain your data
matches = regexp(raw, '^\s*\d(.*?\de[+\-]\d){6}');

%// Empty matches (header lines) should be removed
toremove = cellfun(@isempty, matches);
raw = raw(~toremove);

Then you can convert the result into a numeric array using str2num combined with strjoin.

data = reshape(str2num(strjoin(raw)), 7, []).';

The benefit of this answer is that you avoid using any sort of looping or repeated function calls which are notorious for slowing MATLAB down.

Update

An alternate version of @Pursuit's answer would be something like:

numbers = cellfun(@(x)sscanf(x, '%f %f %f %f %f %f %f').', raw, 'uni', 0);
numbers = cat(1, numbers{:});

edited Apr 13, 2016 at 16:44

answered Apr 13, 2016 at 15:32

Suever

65.6k14 gold badges91 silver badges104 bronze badges

4 Comments

Laurengineer Over a year ago

I think that was similar to my original plan, though there are other lines within the text file that begin with a number, are a different length, and aren't relevant, which could cause problems?

Suever Over a year ago

@LADransfield There are no loops so it is must faster than your initial solution. Look at the last regex and sub that in and it should only detect the relevant rows

Suever Over a year ago

@LADransfield It would also help us provide a better answer if you show some of these rows. Maybe upload it to an external site so we can do some benchmarking?

Laurengineer Over a year ago

I've updated to include some other possible number formats and a couple of extra lines that repeatedly come after the main block of numbers. The main blocks aren't always a set length or set number of columns, though the numbers will always increment by one. The problem comes when the line after the main block has other information in a very similar format which, using the method above, are picked up.

Pursuit · Accepted Answer · 2016-04-13 16:22:34Z

I would try running sscanf on each line, and only using the lines with a good hit.

Note that if:

raw{11} = '11  3.8065e-02  5.2115e-03  4.6128e-03  4.2786e-07  4.0619e-04  0.0000e+00'
raw{12} = 'iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph'

Then

>> sscanf(raw{11},'%f')
ans =
                        11
                  0.038065
                 0.0052115
                 0.0046128
                4.2786e-07
                0.00040619
                         0

And:

>> sscanf(raw{12},'%f')
ans =
     []

To complete this thought, your code would look like this:

%% Read the file
file = 'dataFile.txt';
fid = fopen(file, 'r');
raw = textscan(fid, '%s', 'Delimiter', '\n');
fid = fclose(fid);
raw = raw{1,1}

%% Parse the file into the "residuals" variable

nextLine = 1; %This is the index of next line to insert

%Go through each line, one at a time
for ix = 1:length(raw)    
    %Parse the line with sscanf
    numbers = sscanf(raw{ix},'%f');

    if ~isempty(numbers)  %Skip any row that did not parse, otherwise ...
        %If you know the number of columns, you could replace "~isempty()" with "length()== "

        if nextLine == 1
            %If this is the first line of numbers, then initialize the
            %"residuals" variable.
            residuals= zeros(length(raw), length(numbers));
        end

        %Store the data, and increment "nextLine"
        residuals(nextLine,:) = numbers;
        nextLine = nextLine + 1;
    end
end

%Now, trim the excess alloction from "residuals"
residuals = residuals(1:(nextLine-1),:)

(Please let me know how it compares in speed.)

Collectives™ on Stack Overflow

MATLAB data parse optimisation

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related