4

I have several .csv files that I read with matlab using textscan, beause csvread and xlsread do not support this size of a file 200Mb-600Mb.

I use this line to read it:

C = textscan(fileID,'%s%d%s%f%f%d%d%d%d%d%d%d','delimiter',',');

the problem that I have found that sometimes the data is not in this format and then the textscan stop to read in that line without any error.

So what I have done is to read it in this way

C = textscan(fileID,'%s%d%s%f%f%s%s%s%s%s%s%s%s%s%s%s','delimiter',',');

In this way I see the in 2 rows out of 3 milion there is a change in the format.

I want to read all the lines except the bad/different lines. In addition if its possible to read only the lines that the first string is 'PAA'. is it possible ?

I have tried to load it directly to matlab but its super slow and sometime it get stuck. Or for the realy big one it will announce memory problem.

Any recomendations?

2
  • Which data types does your file contains? :) Commented Dec 10, 2015 at 15:33
  • super cool GPS Data :) Commented Dec 17, 2015 at 10:59

2 Answers 2

3

For large files which are still small enough to fit your memory, parsing all lines at once is typically the best choice.

f = fopen('data.txt');             
g = textscan(f,'%s','delimiter','\n');
fclose(f);

In a next step you have to identify the lines starting with PAA use strncmp.

Now having your data filtered, apply your textscan expression above to each line. If it fails, try the other.

Sign up to request clarification or add additional context in comments.

1 Comment

it wirte me the next error. Warning: The encoding 'windows-1255' is not supported. See the documentation for FOPEN. by the way its an csv file
0

Matlab is slow with this kind of thing because it needs to load everything into memory. I would suggest using grep/bash/cmd lines to reduce your file to readable lines before processing them in Matlab, in Linux you can:

awk '{if (p ~ /^PAA/ && $1 ~ /^PAA/) print; p=$1}' yourfile.csv > yourNewFile.csv   %// This will give you a new file with all the lines that starts with PAA (NOTE: Case sensitive)

To Find lines that does not have the same format, you can use:

awk -F ','  'NF = 12 {print NR, $0} ' yourfile.csv > yourNewFile.csv

This line looks at 12 delimiters for each line, and discard any line that has more than 12 ",".

1 Comment

i never worked with linuks. do i need to download a program ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.