2

update:

Thanks for helping out. Actually I used the CSV parser to get what I want but I ask just because I want to know how the inner part of CSV parser works.


It's a part from Google Analytics CSV report. Actually I have found many other libs to retrieve what I want but I just really want to know what is the best way to get the data I want from this particular case. Though at first it looks not that hard, it's getting my crazy...

The data looks like this as a string:

/page1/index.php,"795,852","620,499",00:03:25,"33,416",10.82%,66.43%,$0.00

The string /page1/index.php is a page's name. The first number "795,852" is the page view The second number "620,499" is the unique page view then with the avg. duration time.

Then I want to parse it to an object as:

{
  page: "/page1/index.php"
  pv: 795852
  uv: 620499
  avg_time:"00:03:25"
}

For some reasons, I only need to keep the first four data from this string. When I try to use a simple JavaScript code to parse, everything works fine until I found something different when the "pageviews" data are small.

For instance, sometimes it looks like:

/page2/index.php,"795,852",620,00:03:25,"33,416",10.82%,66.43%,$0.00

Or:

/page3/index.php,852,"620,499",00:03:25,"33,416",10.82%,66.43%,$0.00

Or:

/page4/index.php,852,620,00:03:25,"33,416",10.82%,66.43%,$0.00

The rule is: when the number is bigger than a thousand, it is written as

"795,852"

But when the number is smaller, it's just

852

There is no "" with it and of course, no , as the splitter. This makes it very hard to use just Regular Expression to get the data.

This makes it very difficult to parse the string into a wanted object, something like:

{
  page: "/page1/index.php"
  pv: 795852
  uv: 620499
  avg_time:"00:03:25"
}

any good ideas on parsing this with JavaScript?

1
  • 1
    Regarding your "update", the csv parser won't be using a regex either. It will actually parse the csv according to the standard, meaning popping off character by character, determining what each character means given the current context, and creating the meaningful data based on that. This is generally implemented using a finite state machine. Commented Sep 15, 2015 at 19:50

4 Answers 4

2

Use a csv parser, not Regex. Try something like this: https://www.npmjs.com/package/csv

Regex is not a suitable tool for parsing CSV.

Sign up to request clarification or add additional context in comments.

Comments

1

I agree with the arguments against using regex for such problems, in general, and it would probably be easier to use a proper parser; however, in this case, I think a regex will work:

^([^,]+),(("[^"]+")|([^,]+)),(("[^"]+")|([^,]+)),([^,]+),

That is:

  • the first field is everything up to the first comma
  • if the next field starts with a ", get everything up to the next "; otherwise, get everything up to the next comma
  • Ditto for next field
  • Last field is everything up to the next comma

2 Comments

@AwQituiGuo: Because of the nested grouping, you'd need to extract elements 1,2,5 and 8.
@AwQituiGuo: Nice try; your input is incomplete.
0

Try some CSV parser, like Papa parse.

Comments

0

How about:

var data = [
  '/page1/index.php,"795,852","620,499",00:03:25,"33,416",10.82%,66.43%,$0.00',
  '/page2/index.php,"795,852",620,00:03:25,"33,416",10.82%,66.43%,$0.00',
  '/page3/index.php,852,"620,499",00:03:25,"33,416",10.82%,66.43%,$0.00',
  '/page4/index.php,852,620,00:03:25,"33,416",10.82%,66.43%,$0.00'
];

data.map(function (item) {
  return item.replace(/"(\d+),(\d+)"/g, '$1$2');
}).map(function (item) {
  var a = item.split(',');
  return {
    page: a[0],
    pv: parseInt(a[1]),
    uv: parseInt(a[2]),
    avg_time: a[3]
  };
});

Which results in:

[
  {
    "page": "/page1/index.php",
    "pv": 795852,
    "uv": 620499,
    "avg_time": "00:03:25"
  },
  {
    "page": "/page2/index.php",
    "pv": 795852,
    "uv": 620,
    "avg_time": "00:03:25"
  },
  {
    "page": "/page3/index.php",
    "pv": 852,
    "uv": 620499,
    "avg_time": "00:03:25"
  },
  {
    "page": "/page4/index.php",
    "pv": 852,
    "uv": 620,
    "avg_time": "00:03:25"
  }
]

What's wrong with this?

  • It's fragile
  • The RegEx to replace the , in the numbers is weak

But...

  • It seems to work!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.