Decomposing a large text file, with alternating headers and text, into an array of headers and an array of text segments

Question

I have a very large text file, test.txt, in the following format:

(Same specific character or word like '^' or '*') Title 1 (end-of-line)

Text 1 (with paragraphs markers, end-of-line markers spaces, all kinds of stuff)

(Same specific character or word like '^' or '*') Title 2 (end-of-line)

Text 2 (with paragraphs markers, spaces, all kinds of stuff)

(Same specific character or word like '^' or '*') Title 3 (end-of-line)

Text 3 (with paragraphs markers, spaces, all kinds of stuff)

...

I'd like to make two arrays, one corresponding to the strings { "Title 1", "Title 2", "Title 3", ...}, and the other corresponding to the strings { "Text 1", "Text 2", "Text 3", ...}.
Is there a simple one-liner to do this?

Here's a specific test example.

Here each line starting with ^ should be a "Title" (of which there are three), and the material between the ^ corresponds to "Text" (of which there are three sections). Notice that ^ only appears as the first character in the "Title" string, and that each title is finished with an end-of-line, and finally that each "Text" section can consist of multiple lines of strings.

halirutan · Accepted Answer · 2014-04-04 07:21:11Z

1

How about using StringSplit?

text = Import["http://pastebin.com/raw.php?i=r3W9pN2L", "Text"];
StringTrim /@ 
  StringSplit[text, StartOfLine ~~ "^" ~~ Shortest[title__] ~~ EndOfLine :> title]

answered Apr 4, 2014 at 7:21

halirutan

114k7 gold badges269 silver badges488 bronze badges

$\begingroup$ LOL, this was flagged as "low quality". Is it the case that short posts get flagged auto-magically? I've seen this a few times with short (but quality) answers. $\endgroup$

ciao
– ciao

2014-04-04 07:42:02 +00:00
Commented Apr 4, 2014 at 7:42
$\begingroup$ you might want to put the comment mark back ie :> "^"<>title, then you could easily segregate them out with StringCases $\endgroup$

george2079
– george2079

2014-04-04 14:45:03 +00:00
Commented Apr 4, 2014 at 14:45
$\begingroup$ The OP wanted only the titles, but I hope he can adapt the example himself to his needs. In the current form it is always {title, body, title body,...} so a Partition[#,2] and a Transpose separates title and bodies easily. $\endgroup$

halirutan
– halirutan

2014-04-04 15:10:38 +00:00
Commented Apr 4, 2014 at 15:10

Add a comment |

ciao · Accepted Answer · 2014-04-04 07:35:06Z

1

input = StringSplit[#, "\n"] & /@ 
  StringSplit[Import["c:\\Users\\Rasher\\Documents\\testtext.txt"], "^"]

title = input[[All, 1]]
text = input[[All, 2 ;;]]

Titles in titles, obviously, with indicator stripped. Text in corresponding text element, as a list with each line of text an element.

edited Apr 4, 2014 at 7:35

answered Apr 4, 2014 at 7:26

ciao

26k2 gold badges62 silver badges145 bronze badges

$\begingroup$ So the "2" in the "[[All, 2 ;;]]" specifies that the title is only one line? $\endgroup$

CA30
– CA30

2014-04-04 07:29:47 +00:00
Commented Apr 4, 2014 at 7:29
$\begingroup$ @CA30- disregard, this is broken, not sure how it slipped by my test. Edit - copied wrong code. Yes, the assumption here is that newline after "^" is end of title. $\endgroup$

ciao
– ciao

2014-04-04 07:31:57 +00:00
Commented Apr 4, 2014 at 7:31
$\begingroup$ @CA30: If you need multi-line titles, you'll need some kind of demarcation obviously, since a newline will not suffice to determine the "end" of a title. $\endgroup$

ciao
– ciao

2014-04-04 07:39:53 +00:00
Commented Apr 4, 2014 at 7:39
$\begingroup$ Hmm, this seems to break up the text section into separate elements for one body of text? I get: {{"moern#@$@$@#%**((&*)FGDFSDVEEDBDFBDVDFD", "FSDEcc", "SASSKAla", "rejffeioj%%$#$$##!"}, {"fjsiu", "oe"}, {"jncsdhibcvowu", "&&&&&&&", "()()()()()()()***)(())"}} $\endgroup$

CA30
– CA30

2014-04-04 13:25:07 +00:00
Commented Apr 4, 2014 at 13:25
$\begingroup$ best do StringSplit[.., StartOfLine ~~ "^"] in case the comment token happens to occur elsewhere. $\endgroup$

george2079
– george2079

2014-04-04 14:55:26 +00:00
Commented Apr 4, 2014 at 14:55

Add a comment |

Stack Exchange Network

Decomposing a large text file, with alternating headers and text, into an array of headers and an array of text segments

2 Answers 2

Your Answer

Hot Network Questions

Decomposing a large text file, with alternating headers and text, into an array of headers and an array of text segments

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions