Extract string from text file

Question

I want to extract the string between the two words (start, end) in a text file but want to start extraction after 2nd occurrence of start till end.

For example, my text is

test.text <- c("During the year new factories at Haridwar for LV apparatus and at Bangalore for LV electric motors commenced production. Further increases in range and LV switchgear capacity augmentation are planned for  motors, HT motors, Drives and .")

I need to start extracting text after the second "LV" (ignore the one which comes later) (case insensitive) till "capacity".

Output should be like:

electric motors commenced production. Further increases in range and

Hi, Welcome to SO. Can you please help us with your code that you are trying? — Hardik Gupta
– Hardik Gupta, Commented Oct 6, 2017 at 5:21
You said "ignore the one which comes later", but your expected output stops at the LV "that comes later", shouldn't it be electric motors commenced production. Further increases in range and LV switchgear? — acylam
– acylam, Commented Oct 6, 2017 at 13:25
ohh.. sorry. I want the output till "LV switchgear" ended before "capacity" like this : "electric motors commenced production. Further increases in range and LV switchgear". Just want "LV" to be ignore after 2nd occurrence, It shoud not affect output flow. — Jain0310
– Jain0310, Commented Oct 7, 2017 at 10:36
Consider accepting the answer that helped you the most by clicking on the grey check mark under the downvote button. — acylam
– acylam, Commented Oct 9, 2017 at 12:38

akrun · Accepted Answer · 2017-10-06 05:34:36Z

2

We could locate the position and then do a substr

library(stringr)
i1 <- str_locate_all(test.text, "LV")[[1]][2,2]+2
i2 <- str_locate(test.text, "capacity")[[1]]-2
sub("\\sLV.*", "", substr(test.text, i1, i2))
#[1] "electric motors commenced production. Further increases in range and"

answered Oct 6, 2017 at 5:34

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jain0310 Over a year ago

Thank You very much... It helped in big way :)

Jain0310 Over a year ago

Please explain 2nd line of code and especially the use of this (\\s ?! ) notations.I want to start extraction from 2nd occurrence of either of starting words to 2nd occurrence of either of ending words. for example cities <-c("Sydney Banglore Mumbai Newyork banglore LA LS banglore London Chicago mumbai Miami") start extraction either from 2nd occurrence Banglore or chennai or New South Wales (either one will be present, not all) to 2nd occurrence of mumbai or Michigan or New Delhi (either one will be present, not all) Output should be like "LA LS banglore London Chicago". Please help

akrun Over a year ago

@JainArihant The \\s implies space. Unlike in other languages, we escape with one more slash

acylam · Accepted Answer · 2017-10-06 13:34:16Z

1

A solution with strsplit:

strsplit(test.text, "\\sLV\\s")[[1]][3]    
# [1] "electric motors commenced production. Further increases in range and"

strsplit(test.text, "\\s(LV(?!\\sswitchgear)|capacity)\\s", perl = TRUE)[[1]][3]
# [1] "electric motors commenced production. Further increases in range and LV switchgear"

The first line gives OP's expected output. The second line gives what I think OP really meant.

answered Oct 6, 2017 at 13:34

acylam

18.7k5 gold badges39 silver badges47 bronze badges

2 Comments

Jain0310 Over a year ago

Please explain 2nd line of code and especially the use of this (\\s ?! ) notations.I want to start extraction from 2nd occurrence of either of starting words to 2nd occurrence of either of ending words. for example cities <-c("Sydney Banglore Mumbai Newyork banglore LA LS banglore London Chicago mumbai Miami") start extraction either from 2nd occurrence Banglore or chennai or New South Wales (either one will be present, not all) to 2nd occurrence of mumbai or Michigan or New Delhi (either one will be present, not all) Output should be like "LA LS banglore London Chicago". Please help

acylam Over a year ago

@JainArihant \\s stands for space. (?!\\sswitchgear) is a negative lookahead, meaning "not before " switchgear"", so (LV(?!\\sswitchgear) matches all "LV"'s not immediately followed by a space and "switchgear". For the new specification, either edit your question or ask a new question. It is generally discouraged to add additional requirements like that in the comments.

Collectives™ on Stack Overflow

Extract string from text file

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related