CSV parsers and OrderedDictionary

Question

I wrote the following (static) class to read files easily in C#. However, it is a static class, so I wonder if there would be a gain to make it a non static method (and therefore, passing arguments as the separator and the path as attributes of the class).

Besides, the OrderedDictionary class may not be a good choice (it is not generic but it preserves the order of the columns), and I don't know if there is a nice workaround for this.

using System;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.IO;

namespace MySolution.FileUtils
{
    /// <summary>
    /// Various helpers to read through the lines of a file.
    /// </summary>
    public static class LinesEnumerator
    {
        public static string ReadFirstLine(string path)
        {
            using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            using (StreamReader sr = new StreamReader(fs))
            {
                return sr.ReadLine();
            }
        }

        /// <summary>
        /// Enumerates the lines of a file.
        /// </summary>
        /// <param name="path">The path of the file.</param>
        /// <param name="maxLines">The maximum number of lines to read.</param>
        /// <returns>The lines of a file, as a IEnumerable</returns>
        public static IEnumerable<string> LinesReader(string path, int maxLines = Int32.MaxValue)
        {
            string line;
            int lineNumber = 0;
            using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            using (StreamReader sr = new StreamReader(fs))
            {
                while ((line = sr.ReadLine()) != null && lineNumber < maxLines)
                {
                    lineNumber++;
                    yield return line;
                }
            }
        }

        /// <summary>
        /// Reads the lines of a CSV file and enumerates them through a csv file
        /// </summary>
        /// <param name="path">Path to the CSV file</param>
        /// <param name="separator">Separator used by the CSV file</param>
        /// <returns>A collection of ordered dictionaries, corresponding to the lines of the file</returns>
        public static IEnumerable<OrderedDictionary> DictReaderCSV(string path, char separator)
        {
            using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            using (StreamReader sr = new StreamReader(fs))
            {
                string[] header = sr.ReadLine().Split(separator);
                string line = "";
                while ((line = sr.ReadLine()) != null)
                {
                    string[] splittedLine = line.Split(separator);
                    OrderedDictionary od = new OrderedDictionary();
                    for (int i = 0; i < header.Length; i++)
                        od.Add(header[i], splittedLine[i]);
                    yield return od;
                }
            }
        }
    }
}

See tools.ietf.org/html/rfc4180 for details on format, quotation. — André LFS Bacci
– André LFS Bacci, Commented Aug 20, 2016 at 14:56

D. Jurcau · Accepted Answer · 2016-08-20 12:03:19Z

Naming Items

You should avoid abbreviations like Dict in DictReaderCSV, spelling it Dictionary is cleaner.

Next, it's much more natural to use nouns for class names and verbs for methods. As such, I would name your class CSVDictionaryReader and your main method EnumerateEntries(). LinesEnumerator didn't give any hint that the class is also responsible for parsing CSV.

Unused code

ReadFirstLine and ReadFirstLine don't seem to be used in the current code.

Streams

The StreamReader class already provides constructors that receive a disk path, so no need to manually create a FileStream just to pass it onto the StreamReader.

What would however give much more flexibility is for the class to allow the user to provide it with an already constructed Stream or TextReader. Perhaps the user wants to parse a 100 MB CSV as it's arriving over the wire, without writing it to the disk.

Parsing CSV

CSV files allow for quotes to be used so that entries can contain the separator. A parser should be prepared to correctly deal with this.

Id,Name,Age
1,John,20
2,"Jack, the second", 21

Stefano d'Antonio · Accepted Answer · 2016-08-20 19:30:45Z

If you need a classic CSV reader there are plenty of good choices in NuGet, if you don't have special requirements, I would just use the common ones.

That said, I made few changes to simplify the code and reduce duplication:

public static class CsvReader
{
    public static IEnumerable<string> GetLines(string path, int limit) => GetLines(path).Take(limit);

    public static IEnumerable<string> GetLines(string path)
    {
        using (var streamReader = new StreamReader(path))
        {
            string line;

            while ((line = streamReader.ReadLine()) != null)
                yield return line;
        }
    }

    public static IEnumerable<OrderedDictionary> GetCharSeparatedDictionaries(string path, char separator)
    {
        var lineEnumerator = GetLines(path).GetEnumerator();

        if(!lineEnumerator.MoveNext())
            yield break;

        var header = lineEnumerator.Current.Split(separator);

        if(header.Distinct().Count() != header.Length)
            // Handle duplicate headers to prevent exception later.
            yield break;

        while (lineEnumerator.MoveNext())
        {
            var splittedLine = lineEnumerator.Current.Split(separator);

            // Not sure why you would want a dictionary for each line (?)
            var orderedDictionary = new OrderedDictionary();

            // Assuming the file is valid and lines do not contain less fields than the headers. If assumption wrong, need to check.
            for (var i = 0; i < header.Length; i++)
                orderedDictionary.Add(header[i], splittedLine[i]);

            yield return orderedDictionary;
        }
    }
}

I appreciate you may have added File.Open to specify a specific sharing mode, if so, just put it back unless there is a suitable overload of StreamReader.

There are few weak points for exceptions, I have highlighted some with comments and some with some defensive code, if you have already validated your input file, just remove the checks. ( File not empty on the headers fetching, duplicate headers ).

LINQ is a great code saver for the limit method with Take and I have removed the duplication.

Name-wise I have removed CSV as your dictionary factory method takes any separator and not just a comma, speaking of which, I would also rename the class to something which is less specific.

If you want the class to be testable you can inject the file reader and to mock it.

Good comment from @D. Jurcau about the separator in the content; one more reason to use a good existing CSV parser and then add your logic for the dictionary.

oh no, a for and an if without {} - your code review needs a code review ;-) — t3chb0t
– t3chb0t, Commented Aug 20, 2016 at 19:47
@t3chb0t less distractions from the real code :) I used to write VB.NET, I don't like parenthesis! But drop lines for debugging are a must. — Stefano d'Antonio
– Stefano d'Antonio, Commented Aug 20, 2016 at 19:51

t3chb0t · Accepted Answer · 2016-11-17 18:21:48Z

1

I suggest you only use the LinesReader to read lines and replace the ReadFirstLine with a call to FirstOrDefault():

return LinesReader(path).FirstOrDefault();

This way you avoid code repetition.

In DictReaderCSV you can do the same and reuse the reader:

var lines = LineReader(path):
var header = lines.FirstOrDefault();
foreach(var line in lines.Skip(1))
{
    ...
}

Making it non-static and providing an abstraction would enable you to use it via Dependency Injection.

edited Nov 17, 2016 at 18:21

answered Aug 20, 2016 at 11:58

t3chb0t

44.7k9 gold badges85 silver badges191 bronze badges

\$\begingroup\$ I would avoid calling FirstOrDefault and then Skip, this will enumerate the collection twice and therefore open the file twice and double up the creation of the FileStream/StreamReader instances. \$\endgroup\$

Stefano d'Antonio
– Stefano d'Antonio

2016-08-20 19:28:33 +00:00
Commented Aug 20, 2016 at 19:28
1

\$\begingroup\$ @Uno oh, your absolutely right, I meant to add .ToList() in the first line but somethow forgot to do it. \$\endgroup\$

t3chb0t
– t3chb0t

2016-08-20 19:30:05 +00:00
Commented Aug 20, 2016 at 19:30
\$\begingroup\$ That would work, but it might defeat the purpose of using an iterator to read line by line as it will load the whole file in memory. \$\endgroup\$

Stefano d'Antonio
– Stefano d'Antonio

2016-08-20 19:33:01 +00:00
Commented Aug 20, 2016 at 19:33
\$\begingroup\$ @Uno as the OP didn't say anything about the size of the files I assume they are digestable and only a few kilobytes or megabytes in size - in this case the YAGNI priciple applies ;-) or premature optimization is the root of all evil. \$\endgroup\$

t3chb0t
– t3chb0t

2016-08-20 19:35:57 +00:00
Commented Aug 20, 2016 at 19:35
\$\begingroup\$ True, but at this point I would remove the iterator and ToList() in the LinesReader as it will save the creation of the state machine. (Assuming LineReader is not used somewhere else with a different intent, but that would be odd). \$\endgroup\$

Stefano d'Antonio
– Stefano d'Antonio

2016-08-20 19:40:30 +00:00
Commented Aug 20, 2016 at 19:40

Add a comment |

paparazzo · Accepted Answer · 2016-08-21 17:21:15Z

0

I don't like the overhead of

IEnumerable<OrderedDictionary>

You are repeating the column names for every row.
There is no check for count not matching in row versus header.

Consider

IEnumerable<IEnumerable<string>>

Or

IEnumerable<string[]>

Somehow report if line does not match header

Split accepts char[]
Might as well support that

edited Aug 21, 2016 at 17:21

answered Aug 21, 2016 at 11:51

paparazzo

6,1363 gold badges20 silver badges43 bronze badges

Add a comment |

Stack Exchange Network

CSV parsers and OrderedDictionary

4 Answers 4

You must log in to answer this question.

Hot Network Questions

CSV parsers and OrderedDictionary

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions