StringCases functionality

Question

I have a list of strings in where each element of the list is in this form:

{"created_at":"Thu Aug 08 20:53:26 +0000 2013","id":365576505679568896,"id_str":"365576505679568896","text":"Who wears it better? #TBT http:\/\/t.co\/vAXNgiRmYo","source":"web"

I am trying to extract some specific parts of each element. I prepared this function:

extract[string_] := StringCases[string, {
    "\"created_at\":\"" ~~ Shortest@x__ ~~ "\",\"id\":" :> x,
    "\",\"id\":" ~~ Shortest@b__ ~~ ",\"id_str\":\"" :> b,
    ",\"id_str\":\"" ~~ Shortest@c__ ~~ "\",\"text\":\"" :> c,
    "\",\"text\":\"" ~~ Shortest@d__ ~~ "\",\"source\":\"" :> d,
    "\",\"source\":\"" ~~ Shortest@e__ ~~ "\",\"truncated\":" :> e
   }
]

And then

extract/@listOFelements

But as example for the element above I get this result:

{"Thu Aug 08 20:53:26 +0000 2013", "365576505679568896", "web"}

Some elements like the text flanked by "\",\"text\":\"" and "\",\"source\":\"" are not detected from the string. How should I make it possible to detect it?

Your first expression is not closed, it's missing the end part after web... — Murta
– Murta, Commented Aug 9, 2013 at 10:47
Even the first brace is a part of the string!! The string ends with no brace, as I have manipulated it before. — Morry
– Morry, Commented Aug 9, 2013 at 10:51
I can't copy the expression into MMA. It get's a error, where is the "\",\"truncated\":" part? Can you correct it — Murta
– Murta, Commented Aug 9, 2013 at 10:53
If the whole of your "list of strings" is one string, you don't have a list of strings. — m_goldberg
– m_goldberg, Commented Aug 9, 2013 at 10:54
I am just copying the text directly from MMA to web! And I have a list of element which each element has the same structure as the sample above. But the information flanked by the parts in the formula are different! — Morry
– Morry, Commented Aug 9, 2013 at 10:59

C. E. · Accepted Answer · 2013-08-09 12:46:48Z

4

You say you manipulated the string before and that's why it's missing a brace at the end. It looks like you may have deformed a JSON string, in which case you did yourself a big disservice as such lists can be imported by MMA.

Let's first repair your string:

str = "{\"created_at\":\"Thu Aug 08 20:53:26 +0000 \
2013\",\"id\":365576505679568896,\"id_str\":\"365576505679568896\",\"\
text\":\"Who wears it better? #TBT \
http:\\/\\/t.co\\/vAXNgiRmYo\",\"source\":\"web\"";
repaired = str <> "}"

Now, import the string:

rules = ImportString[repaired, "JSON"];

Extract the information you want:

{"created_at", "id", "source"} /. rules

{"Thu Aug 08 20:53:26 +0000 2013", 365576505679568896, "web"}

JSON is a very popular data format, so you would do well to remember it and recognize it where it pops up.

I also note that you've asked eight questions so far and have accepted no answer for any of those questions.

answered Aug 9, 2013 at 12:46

C. E.♦

71.7k7 gold badges144 silver badges279 bronze badges

$\begingroup$ The reason I did not accept the solutions is I realized non of them is a general solution but maybe a solution like the other answer to this post which is not a consistent one!! $\endgroup$

Morry
– Morry

2013-08-09 13:38:29 +00:00
Commented Aug 9, 2013 at 13:38
$\begingroup$ And I have another element which is not importable by this function!! $\endgroup$

Morry
– Morry

2013-08-09 13:52:08 +00:00
Commented Aug 9, 2013 at 13:52
$\begingroup$ @Morry Perhaps the problem is the way you're asking your questions, no offense. Provide a representative sample and I will be happy to try to find a general solution. $\endgroup$

C. E.
– C. E. ♦

2013-08-09 13:59:35 +00:00
Commented Aug 9, 2013 at 13:59
$\begingroup$ Please make a look to this case: dl.dropboxusercontent.com/u/76785824/test.nb $\endgroup$

Morry
– Morry

2013-08-09 14:03:04 +00:00
Commented Aug 9, 2013 at 14:03
$\begingroup$ Maybe you are right.... You were not offensive at all. :) $\endgroup$

Morry
– Morry

2013-08-09 14:03:34 +00:00
Commented Aug 9, 2013 at 14:03

| Show 7 more comments

Murta · Accepted Answer · 2013-08-09 11:28:16Z

1

Assuming that you first expression is text, I prefer the RegularExpression approach as follows:

str = "{\"created_at\":\"Thu Aug 08 20:53:26 +0000 \
2013\",\"id\":365576505679568896,\"id_str\":\"365576505679568896\",\"\
text\":\"Who wears it better? #TBT \
http:\/\/t.co\/vAXNgiRmYo\",\"source\":\"web"

re=RegularExpression;
StringCases[str,
    {re["\"created_at\":\"(.+?)\""]-> "$1"
    ,re["\"id\":(.+?),"]-> "$1"
    ,re["\"id_str\":\"(.+?)\""]-> "$1"
    ,re["\"text\":\"(.+?)\","]-> "$1"
}
]

you get:

{"Thu Aug 08 20:53:26 +0000 2013","365576505679568896","365576505679568896","Who wears it better? #TBT http:\/\/t.co\/vAXNgiRmYo"}

answered Aug 9, 2013 at 11:28

Murta

26.5k6 gold badges78 silver badges173 bronze badges

$\begingroup$ Something to consider: (25677) $\endgroup$

Mr.Wizard
– Mr.Wizard

2013-08-09 12:06:15 +00:00
Commented Aug 9, 2013 at 12:06
$\begingroup$ The fact is my text is longer and there would be some duplicates within this procedure. I would like to make a more specific patterns like this: re["\"created_at\":\"(.+?)\",\"id\":"] -> "$1", re["\"id\":(.+?),\"id_str\":\""] -> "$1", re["\",\"text\":\"(.+?)\",\"source\":\""] -> "$1", re["\"id_str\":\"(.+?)\",\"text\":\""] -> "$1" where fails to make the proper result! $\endgroup$

Morry
– Morry

2013-08-09 12:18:03 +00:00
Commented Aug 9, 2013 at 12:18
$\begingroup$ Can you give us a better example in your question? $\endgroup$

Murta
– Murta

2013-08-09 14:05:25 +00:00
Commented Aug 9, 2013 at 14:05

Add a comment |

Stack Exchange Network

StringCases functionality

2 Answers 2

Your Answer

Linked

Hot Network Questions

StringCases functionality

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions