-3
$\begingroup$

I have a list of strings in where each element of the list is in this form:

{"created_at":"Thu Aug 08 20:53:26 +0000 2013","id":365576505679568896,"id_str":"365576505679568896","text":"Who wears it better? #TBT http:\/\/t.co\/vAXNgiRmYo","source":"web"

I am trying to extract some specific parts of each element. I prepared this function:

extract[string_] := StringCases[string, {
    "\"created_at\":\"" ~~ Shortest@x__ ~~ "\",\"id\":" :> x,
    "\",\"id\":" ~~ Shortest@b__ ~~ ",\"id_str\":\"" :> b,
    ",\"id_str\":\"" ~~ Shortest@c__ ~~ "\",\"text\":\"" :> c,
    "\",\"text\":\"" ~~ Shortest@d__ ~~ "\",\"source\":\"" :> d,
    "\",\"source\":\"" ~~ Shortest@e__ ~~ "\",\"truncated\":" :> e
   }
]

And then

extract/@listOFelements

But as example for the element above I get this result:

{"Thu Aug 08 20:53:26 +0000 2013", "365576505679568896", "web"}

Some elements like the text flanked by "\",\"text\":\"" and "\",\"source\":\"" are not detected from the string. How should I make it possible to detect it?

$\endgroup$
6
  • $\begingroup$ Your first expression is not closed, it's missing the end part after web... $\endgroup$ Commented Aug 9, 2013 at 10:47
  • $\begingroup$ Even the first brace is a part of the string!! The string ends with no brace, as I have manipulated it before. $\endgroup$ Commented Aug 9, 2013 at 10:51
  • $\begingroup$ I can't copy the expression into MMA. It get's a error, where is the "\",\"truncated\":" part? Can you correct it $\endgroup$ Commented Aug 9, 2013 at 10:53
  • 1
    $\begingroup$ If the whole of your "list of strings" is one string, you don't have a list of strings. $\endgroup$ Commented Aug 9, 2013 at 10:54
  • $\begingroup$ I am just copying the text directly from MMA to web! And I have a list of element which each element has the same structure as the sample above. But the information flanked by the parts in the formula are different! $\endgroup$ Commented Aug 9, 2013 at 10:59

2 Answers 2

4
$\begingroup$

You say you manipulated the string before and that's why it's missing a brace at the end. It looks like you may have deformed a JSON string, in which case you did yourself a big disservice as such lists can be imported by MMA.

Let's first repair your string:

str = "{\"created_at\":\"Thu Aug 08 20:53:26 +0000 \
2013\",\"id\":365576505679568896,\"id_str\":\"365576505679568896\",\"\
text\":\"Who wears it better? #TBT \
http:\\/\\/t.co\\/vAXNgiRmYo\",\"source\":\"web\"";
repaired = str <> "}"

Now, import the string:

rules = ImportString[repaired, "JSON"];

Extract the information you want:

{"created_at", "id", "source"} /. rules

{"Thu Aug 08 20:53:26 +0000 2013", 365576505679568896, "web"}

JSON is a very popular data format, so you would do well to remember it and recognize it where it pops up.

I also note that you've asked eight questions so far and have accepted no answer for any of those questions.

$\endgroup$
12
  • $\begingroup$ The reason I did not accept the solutions is I realized non of them is a general solution but maybe a solution like the other answer to this post which is not a consistent one!! $\endgroup$ Commented Aug 9, 2013 at 13:38
  • $\begingroup$ And I have another element which is not importable by this function!! $\endgroup$ Commented Aug 9, 2013 at 13:52
  • $\begingroup$ @Morry Perhaps the problem is the way you're asking your questions, no offense. Provide a representative sample and I will be happy to try to find a general solution. $\endgroup$ Commented Aug 9, 2013 at 13:59
  • $\begingroup$ Please make a look to this case: dl.dropboxusercontent.com/u/76785824/test.nb $\endgroup$ Commented Aug 9, 2013 at 14:03
  • $\begingroup$ Maybe you are right.... You were not offensive at all. :) $\endgroup$ Commented Aug 9, 2013 at 14:03
1
$\begingroup$

Assuming that you first expression is text, I prefer the RegularExpression approach as follows:

str = "{\"created_at\":\"Thu Aug 08 20:53:26 +0000 \
2013\",\"id\":365576505679568896,\"id_str\":\"365576505679568896\",\"\
text\":\"Who wears it better? #TBT \
http:\/\/t.co\/vAXNgiRmYo\",\"source\":\"web"

re=RegularExpression;
StringCases[str,
    {re["\"created_at\":\"(.+?)\""]-> "$1"
    ,re["\"id\":(.+?),"]-> "$1"
    ,re["\"id_str\":\"(.+?)\""]-> "$1"
    ,re["\"text\":\"(.+?)\","]-> "$1"
}
]

you get:

{"Thu Aug 08 20:53:26 +0000 2013","365576505679568896","365576505679568896","Who wears it better? #TBT http:\/\/t.co\/vAXNgiRmYo"}
$\endgroup$
3
  • $\begingroup$ Something to consider: (25677) $\endgroup$ Commented Aug 9, 2013 at 12:06
  • $\begingroup$ The fact is my text is longer and there would be some duplicates within this procedure. I would like to make a more specific patterns like this: re["\"created_at\":\"(.+?)\",\"id\":"] -> "$1", re["\"id\":(.+?),\"id_str\":\""] -> "$1", re["\",\"text\":\"(.+?)\",\"source\":\""] -> "$1", re["\"id_str\":\"(.+?)\",\"text\":\""] -> "$1" where fails to make the proper result! $\endgroup$ Commented Aug 9, 2013 at 12:18
  • $\begingroup$ Can you give us a better example in your question? $\endgroup$ Commented Aug 9, 2013 at 14:05

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.