1
$\begingroup$

I have a very large text file, test.txt, in the following format:

(Same specific character or word like '^' or '*') Title 1 (end-of-line)

Text 1 (with paragraphs markers, end-of-line markers spaces, all kinds of stuff)

(Same specific character or word like '^' or '*') Title 2 (end-of-line)

Text 2 (with paragraphs markers, spaces, all kinds of stuff)

(Same specific character or word like '^' or '*') Title 3 (end-of-line)

Text 3 (with paragraphs markers, spaces, all kinds of stuff)

...

I'd like to make two arrays, one corresponding to the strings { "Title 1", "Title 2", "Title 3", ...}, and the other corresponding to the strings { "Text 1", "Text 2", "Text 3", ...}.
Is there a simple one-liner to do this?

Here's a specific test example.

Here each line starting with ^ should be a "Title" (of which there are three), and the material between the ^ corresponds to "Text" (of which there are three sections). Notice that ^ only appears as the first character in the "Title" string, and that each title is finished with an end-of-line, and finally that each "Text" section can consist of multiple lines of strings.

$\endgroup$

2 Answers 2

1
$\begingroup$

How about using StringSplit?

text = Import["http://pastebin.com/raw.php?i=r3W9pN2L", "Text"];
StringTrim /@ 
  StringSplit[text, StartOfLine ~~ "^" ~~ Shortest[title__] ~~ EndOfLine :> title]
$\endgroup$
3
  • $\begingroup$ LOL, this was flagged as "low quality". Is it the case that short posts get flagged auto-magically? I've seen this a few times with short (but quality) answers. $\endgroup$ Commented Apr 4, 2014 at 7:42
  • $\begingroup$ you might want to put the comment mark back ie :> "^"<>title, then you could easily segregate them out with StringCases $\endgroup$ Commented Apr 4, 2014 at 14:45
  • $\begingroup$ The OP wanted only the titles, but I hope he can adapt the example himself to his needs. In the current form it is always {title, body, title body,...} so a Partition[#,2] and a Transpose separates title and bodies easily. $\endgroup$ Commented Apr 4, 2014 at 15:10
1
$\begingroup$
input = StringSplit[#, "\n"] & /@ 
  StringSplit[Import["c:\\Users\\Rasher\\Documents\\testtext.txt"], "^"]

title = input[[All, 1]]
text = input[[All, 2 ;;]]

Titles in titles, obviously, with indicator stripped. Text in corresponding text element, as a list with each line of text an element.

$\endgroup$
5
  • $\begingroup$ So the "2" in the "[[All, 2 ;;]]" specifies that the title is only one line? $\endgroup$ Commented Apr 4, 2014 at 7:29
  • $\begingroup$ @CA30- disregard, this is broken, not sure how it slipped by my test. Edit - copied wrong code. Yes, the assumption here is that newline after "^" is end of title. $\endgroup$ Commented Apr 4, 2014 at 7:31
  • $\begingroup$ @CA30: If you need multi-line titles, you'll need some kind of demarcation obviously, since a newline will not suffice to determine the "end" of a title. $\endgroup$ Commented Apr 4, 2014 at 7:39
  • $\begingroup$ Hmm, this seems to break up the text section into separate elements for one body of text? I get: {{"moern#@$@$@#%**((&*)FGDFSDVEEDBDFBDVDFD", "FSDEcc", "SASSKAla", "rejffeioj%%$#$$##!"}, {"fjsiu", "oe"}, {"jncsdhibcvowu", "&&&&&&&", "()()()()()()()***)(())"}} $\endgroup$ Commented Apr 4, 2014 at 13:25
  • $\begingroup$ best do StringSplit[.., StartOfLine ~~ "^"] in case the comment token happens to occur elsewhere. $\endgroup$ Commented Apr 4, 2014 at 14:55

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.