0

I have the following data and I need to extract the first string occurrence It is separated from rest of data with \t. I'm trying to use split(),regex but the problem is it is taking more than 1 second to do this for each line. Is there anyway that it could be done faster?

Data:

DT  0.00155095460731831934 0.00121897344629313064 0.00000391325536877105 0.09743272975663436197 0.00002271067721789807 0.00614528909266214615 0.00000445295550745487 0.70422975214810612510 0.00000042521183266708 0.00080380970031485965 0.00046229528280753270 0.00019894095277762626 0.00041012830368947716 0.00013156663380611624 0.00000001065986007929 0.00004244196517011733 0.00061444160944146384 0.02101761386512242258 0.00010328516871273944 0.00001128873771536226 0.00279163054567377073 0.00018903663417650421 0.00006490063677390687 0.00002151218889856898 0.00032824534915777535 0.00040349658620449016 0.00042393411014689220 0.00053643791028589382 0.00001032961180051124 0.00025743865541833909 0.00011497457801324625 0.00005359814320647386 0.00010336445810407512 0.00040942464084107332 0.00009098970100047888 0.00000091369931486168 0.00059479547081431436 0.00000009853464391239 0.00020303484015768289 0.00050594563648307127 0.15679657927655424321 0.00034115929559768240 0.00115490132012489345 0.00019823414624750937
PRP 0.00000131203717608417 0.99998368311809904263 0.00000002192874737415 0.00000073240710142655 0.00000000536610432900 0.00000195554704853124 0.00000000012203475361 0.00000017206852489982 0.00000040268728691384 0.00000034167449501884 0.00000077203219019333 0.00000003082351874675 0.00000052849070550174 0.00000319144710228690 0.00000000009512989203 0.00000002016363199180 0.00000005598551431381 0.00000129166108708107 0.00000004127954869435 0.00000099983230311242 0.00000032415702089502 0.00000010477525952469 0.00000000011045642123 0.00000006942075882668 0.00000017433924380308 0.00000028874823360049 0.00000048656924101513 0.00000017722073116061 0.00000037193481161874 0.00000000452174124394 0.00000081986547018432 0.00000001740977711224 0.00000000808377988046 0.00000001418892143074 0.00000045250939471023 0.00000000000050232556 0.00000043504206149021 0.00000011310292804313 0.00000000013241046549 0.00000015302998639348 0.00000002800056509608 0.00000038361859715043 0.00000000099713364069 0.00000001345362455494
VBD 0.00000002905639670475 0.00000000730896486886 0.00000000406530491040 0.00000009048972500851 0.00000000380338117015 0.00000000000390031394 0.00000000169948197867 0.00000000091890304843 0.00000000013856552537 0.00000191013917141413 0.00000002300239228881 0.00000003601993413087 0.00000004266629173115 0.00000000166497478879 0.00000000000079281873 0.00000180895378547175 0.00000000000159251758 0.00000000081310874277 0.00000000334322892919 0.99999591744268101490 0.00000000000454647012 0.00000000060884665646 0.00000000000010515727 0.00000000019245471748 0.00000000308524019147 0.00000001376847404364 0.00000001449670334202 0.00000001434634011983 0.00000000656887521298 0.00000000796791556475 0.00000000578334901413 0.00000000142124935798 0.00000000213053365838 0.00000000487780229311 0.00000001702409705978 0.00000000391793832836 0.00000001292779157438 0.00000000002447935587 0.00000000000435117453 0.00000000408872313468 0.00000000007201124397 0.00000000431736839121 0.00000000002970930698 0.00000000080852330796
RB  0.00000015663242474016 0.00000002464350694082 0.00000000095443410385 0.99998778106321006831 0.00000000021007124986 0.00000006156902517681 0.00000000277279124155 0.00000000301727284928 0.00000000030682776953 0.00000007379165980724 0.00000012399749754355 0.00000494600825959811 0.00000008488215978963 0.00000000897527112360 0.00000000000009257081 0.00000000223574222125 0.00000000371653801739 0.00000548300954899374 0.00000001802212638276 0.00000000022437343140 0.00000001084514551630 0.00000000328207000562 0.00000000672649111321 0.00000003640165688536 0.00000050812474700731 0.00000007422081603379 0.00000018000760320187 0.00000007733588104368 0.00000008890139839523 0.00000001494850369145 0.00000003233439691280 0.00000000299507821025 0.00000000501198681017 0.00000000271863832841 0.00000004782796496077 0.00000000000160157399 0.00000006968900381578 0.00000000003199719817 0.00000001234122837743 0.00000002204081342858 0.00000000038818632144 0.00000002327335651712 0.00000000016015202564 0.00000000435845392228
VBN 0.00222925562857408935 0.00055631931823257885 0.00000032474066230587 0.00333293927262896372 0.12594759350192680225 0.00142014631420757115 0.00008260266473343272 0.00001658664201138300 0.00000444848747905589 0.00025881226046863004 0.00176478222683846956 0.00226268536384150636 0.00120807701719786715 0.00016158429451364274 0.00000000200391980114 0.00012971908549403702 0.41488930515218963579 0.41237674095727266943 0.00025649814681915863 0.00001340291420511781 0.00067983726358035045 0.00001718712609473795 0.00009573412529081616 0.02342065200703593100 0.00010281749829896253 0.00243912549478067552 0.00111221146411718771 0.00110067534479759994 0.00048702441892562549 0.00014537544850052323 0.00046019613393571187 0.00004100416046505168 0.00001820421200359182 0.00013212194667244404 0.00112515351673182361 0.00000022002597310723 0.00099184191436586821 0.00000187809735682276 0.00000214888688830288 0.00031369371619907773 0.00000552482376141306 0.00033123576486582436 0.00000227934800338172 0.00006203126813779618

So,the bottom line is I need to extract DT, PRP, VBD... from the above text really fast.

2
  • 1
    I can't imagine how splitting a relatively small line takes more than 1 second in Python? Can you provide your code? Commented Feb 24, 2015 at 15:05
  • Perhaps, you can accept an answer? If your problem remains unsolved, add some information to your post. Commented Mar 9, 2015 at 10:29

3 Answers 3

2

You can just call split with maxsplit argument and wrap it into a list generator.

result = [line.split('\t', 1)[0] for line in data] 

As you see, passing 1 in the method call makes it stop after the first splitting takes place. I bet this is the fastest solution in Python.

A manual alternative.

def end_of_loop():
    raise StopIteration


def my_split(line):
    return ''.join(end_of_loop() if char == '\t' else char for char in line)


result = [my_split(line) for line in lines]

Provided your data are in a file:

with open(file) as data:
    result = [my_split(line) for line in data]

This will be a lot slower than the first one.

Sign up to request clarification or add additional context in comments.

1 Comment

Nice answer, I was thinking about just slicing the first 4 characters of each string what would get out the POS-tags here, but it doesn't work because of tab separation.
1

You can use split in a list comprehension :

>>> s="""DT  0.00155095460731831934 0.00121897344629313064 0.00000391325536877105 0.09743272975663436197 0.00002271067721789807 0.00614528909266214615 0.00000445295550745487 0.70422975214810612510 0.00000042521183266708 0.00080380970031485965 0.00046229528280753270 0.00019894095277762626 0.00041012830368947716 0.00013156663380611624 0.00000001065986007929 0.00004244196517011733 0.00061444160944146384 0.02101761386512242258 0.00010328516871273944 0.00001128873771536226 0.00279163054567377073 0.00018903663417650421 0.00006490063677390687 0.00002151218889856898 0.00032824534915777535 0.00040349658620449016 0.00042393411014689220 0.00053643791028589382 0.00001032961180051124 0.00025743865541833909 0.00011497457801324625 0.00005359814320647386 0.00010336445810407512 0.00040942464084107332 0.00009098970100047888 0.00000091369931486168 0.00059479547081431436 0.00000009853464391239 0.00020303484015768289 0.00050594563648307127 0.15679657927655424321 0.00034115929559768240 0.00115490132012489345 0.00019823414624750937
... PRP 0.00000131203717608417 0.99998368311809904263 0.00000002192874737415 0.00000073240710142655 0.00000000536610432900 0.00000195554704853124 0.00000000012203475361 0.00000017206852489982 0.00000040268728691384 0.00000034167449501884 0.00000077203219019333 0.00000003082351874675 0.00000052849070550174 0.00000319144710228690 0.00000000009512989203 0.00000002016363199180 0.00000005598551431381 0.00000129166108708107 0.00000004127954869435 0.00000099983230311242 0.00000032415702089502 0.00000010477525952469 0.00000000011045642123 0.00000006942075882668 0.00000017433924380308 0.00000028874823360049 0.00000048656924101513 0.00000017722073116061 0.00000037193481161874 0.00000000452174124394 0.00000081986547018432 0.00000001740977711224 0.00000000808377988046 0.00000001418892143074 0.00000045250939471023 0.00000000000050232556 0.00000043504206149021 0.00000011310292804313 0.00000000013241046549 0.00000015302998639348 0.00000002800056509608 0.00000038361859715043 0.00000000099713364069 0.00000001345362455494
... VBD 0.00000002905639670475 0.00000000730896486886 0.00000000406530491040 0.00000009048972500851 0.00000000380338117015 0.00000000000390031394 0.00000000169948197867 0.00000000091890304843 0.00000000013856552537 0.00000191013917141413 0.00000002300239228881 0.00000003601993413087 0.00000004266629173115 0.00000000166497478879 0.00000000000079281873 0.00000180895378547175 0.00000000000159251758 0.00000000081310874277 0.00000000334322892919 0.99999591744268101490 0.00000000000454647012 0.00000000060884665646 0.00000000000010515727 0.00000000019245471748 0.00000000308524019147 0.00000001376847404364 0.00000001449670334202 0.00000001434634011983 0.00000000656887521298 0.00000000796791556475 0.00000000578334901413 0.00000000142124935798 0.00000000213053365838 0.00000000487780229311 0.00000001702409705978 0.00000000391793832836 0.00000001292779157438 0.00000000002447935587 0.00000000000435117453 0.00000000408872313468 0.00000000007201124397 0.00000000431736839121 0.00000000002970930698 0.00000000080852330796
... RB  0.00000015663242474016 0.00000002464350694082 0.00000000095443410385 0.99998778106321006831 0.00000000021007124986 0.00000006156902517681 0.00000000277279124155 0.00000000301727284928 0.00000000030682776953 0.00000007379165980724 0.00000012399749754355 0.00000494600825959811 0.00000008488215978963 0.00000000897527112360 0.00000000000009257081 0.00000000223574222125 0.00000000371653801739 0.00000548300954899374 0.00000001802212638276 0.00000000022437343140 0.00000001084514551630 0.00000000328207000562 0.00000000672649111321 0.00000003640165688536 0.00000050812474700731 0.00000007422081603379 0.00000018000760320187 0.00000007733588104368 0.00000008890139839523 0.00000001494850369145 0.00000003233439691280 0.00000000299507821025 0.00000000501198681017 0.00000000271863832841 0.00000004782796496077 0.00000000000160157399 0.00000006968900381578 0.00000000003199719817 0.00000001234122837743 0.00000002204081342858 0.00000000038818632144 0.00000002327335651712 0.00000000016015202564 0.00000000435845392228
... VBN 0.00222925562857408935 0.00055631931823257885 0.00000032474066230587 0.00333293927262896372 0.12594759350192680225 0.00142014631420757115 0.00008260266473343272 0.00001658664201138300 0.00000444848747905589 0.00025881226046863004 0.00176478222683846956 0.00226268536384150636 0.00120807701719786715 0.00016158429451364274 0.00000000200391980114 0.00012971908549403702 0.41488930515218963579 0.41237674095727266943 0.00025649814681915863 0.00001340291420511781 0.00067983726358035045 0.00001718712609473795 0.00009573412529081616 0.02342065200703593100 0.00010281749829896253 0.00243912549478067552 0.00111221146411718771 0.00110067534479759994 0.00048702441892562549 0.00014537544850052323 0.00046019613393571187 0.00004100416046505168 0.00001820421200359182 0.00013212194667244404 0.00112515351673182361 0.00000022002597310723 0.00099184191436586821 0.00000187809735682276 0.00000214888688830288 0.00031369371619907773 0.00000552482376141306 0.00033123576486582436 0.00000227934800338172 0.00006203126813779618"""
>>> [i.split()[0] for i in s.split('\n')]
['DT', 'PRP', 'VBD', 'RB', 'VBN']

Comments

0
import re
p = re.compile(r'^\S+', re.MULTILINE)

re.findall(p, test_str)

You can simply do this to get a list of strings you want.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.