Reading huge .csv files with matlab - file is not well orgenized

Question

I have several .csv files that I read with matlab using textscan, beause csvread and xlsread do not support this size of a file 200Mb-600Mb.

I use this line to read it:

C = textscan(fileID,'%s%d%s%f%f%d%d%d%d%d%d%d','delimiter',',');

the problem that I have found that sometimes the data is not in this format and then the textscan stop to read in that line without any error.

So what I have done is to read it in this way

C = textscan(fileID,'%s%d%s%f%f%s%s%s%s%s%s%s%s%s%s%s','delimiter',',');

In this way I see the in 2 rows out of 3 milion there is a change in the format.

I want to read all the lines except the bad/different lines. In addition if its possible to read only the lines that the first string is 'PAA'. is it possible ?

I have tried to load it directly to matlab but its super slow and sometime it get stuck. Or for the realy big one it will announce memory problem.

Any recomendations?

Which data types does your file contains? :)

Mahesh Kumar Kodanda
– Mahesh Kumar Kodanda

2015-12-10 15:33:32 +00:00
Commented Dec 10, 2015 at 15:33 — Mahesh Kumar Kodanda
– Mahesh Kumar Kodanda, Commented Dec 10, 2015 at 15:33
super cool GPS Data :)

Nati Barchilon
– Nati Barchilon

2015-12-17 10:59:07 +00:00
Commented Dec 17, 2015 at 10:59 — Nati Barchilon
– Nati Barchilon, Commented Dec 17, 2015 at 10:59

Daniel · Accepted Answer · 2015-12-10 15:35:41Z

3

For large files which are still small enough to fit your memory, parsing all lines at once is typically the best choice.

f = fopen('data.txt');             
g = textscan(f,'%s','delimiter','\n');
fclose(f);

In a next step you have to identify the lines starting with PAA use strncmp.

Now having your data filtered, apply your textscan expression above to each line. If it fails, try the other.

answered Dec 10, 2015 at 15:35

Daniel

36.8k3 gold badges38 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nati Barchilon Over a year ago

it wirte me the next error. Warning: The encoding 'windows-1255' is not supported. See the documentation for FOPEN. by the way its an csv file

GameOfThrows · Accepted Answer · 2015-12-10 16:14:24Z

0

Matlab is slow with this kind of thing because it needs to load everything into memory. I would suggest using grep/bash/cmd lines to reduce your file to readable lines before processing them in Matlab, in Linux you can:

awk '{if (p ~ /^PAA/ && $1 ~ /^PAA/) print; p=$1}' yourfile.csv > yourNewFile.csv   %// This will give you a new file with all the lines that starts with PAA (NOTE: Case sensitive)

To Find lines that does not have the same format, you can use:

awk -F ','  'NF = 12 {print NR, $0} ' yourfile.csv > yourNewFile.csv

This line looks at 12 delimiters for each line, and discard any line that has more than 12 ",".

edited Dec 10, 2015 at 16:14

answered Dec 10, 2015 at 15:59

GameOfThrows

4,5202 gold badges30 silver badges44 bronze badges

1 Comment

Nati Barchilon Over a year ago

i never worked with linuks. do i need to download a program ?

Collectives™ on Stack Overflow

Reading huge .csv files with matlab - file is not well orgenized

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related