0

I am trying to extract out all texts including categories (i.e. A, B, C).

A     <some text1> 

B     <some text2> 

C     <some text3> 

However, when I apply this regex -

ptrn='\n[A-z]*\t'     

pattern1= '(.*)'+ptrn      

f = re.findall(pattern1,test_doc)      

it gives me

f[0] = A     <some text1> 

f[1] = <some text2> 

f[2] = <some text3> 

But I want -

f[0] =  A     <some text1>

f[0] =  B     <some text2> 

f[2] =  C     <some text2> 

http://csmining.org/tl_files/Project_Datasets/r8%20r52/r8-test-all-terms.txt

this link has some raw text of many documents. each document has following pattern:

category<tab><sometext> \n 

hence the whole corpus looks like this:-

category<tab><sometext1> \n 

category<tab><sometext2> \n

.

.

i want

doc[0] = category<tab><sometext1>

doc[1] = category<tab><sometext2>

.
.
and so on

Any answer/hint will be very helpful :)

6
  • 1
    Wait so you want to find all the text? Why do you need regex? Is there other text somewhere that you don't want? Commented Jan 27, 2018 at 3:44
  • Why not just use s.split('\n')? Commented Jan 27, 2018 at 3:49
  • @EvanNowak because <some text> can contain '\n' and it will split within <some text> . Commented Jan 27, 2018 at 3:51
  • Maybe you're looking for something like this, but it's hard to tell. Can you give us a more specific example input/output? Commented Jan 27, 2018 at 4:01
  • csmining.org/tl_files/Project_Datasets/r8%20r52/… this link has some raw text of many documents. each document has following pattern: category<tab><sometext1> \n category<tab><sometext2> \n . . i want doc[1] = category<tab><sometext1> doc[2] = category<tab><sometext2> and so on Commented Jan 27, 2018 at 4:10

1 Answer 1

3

Try the following pattern:

import re
pattern = r"(\w+)(\t)(.*)(\b)"

Explanation

  • (\w+) matches any word character, one or many times
  • \t matches the tab character literally
  • (.*) matches everything except line terminators
  • (\b) is a word boundary

See a demo on regex101

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.