How to extract strings from a full text using R?

Question

I am now confused by a problem. I have more than 3,000 observations, each observation is a full text. For example:

text="Ganluo County People's Court of X Province。The plaintiff X, female, born on May, 1980, lives in X County, X Province。The defendant X, male, born on May, 1971, lives in X County, X Province。
It is a divorce dispute, according to 《marriage law》on June 21, 2016。"

Now, I want to extract the information for the plaintiff and defendant, and also I want to know whether this full text contain the word "《marriage law》"(T for yes, F for no)

Thus, I want to have the following results:

text	plaintiff	defendant	law
Ganluo County People's Court of X Province。The plaintiff X, female, born on May, 1980, lives in X County, X Province。The defendant X, male, born on May, 1971, lives in X County, X Province。It is a divorce dispute, according to 《marriage law》on June 21, 2016。	The plaintiff X, female, born on May, 1980, lives in X County, X Province。	The defendant X, male, born on May, 1971, lives in X County, X Province。	T

I tried several times, but it does not work. Many thanks for your kind help!

Follow up:

Thank you for your answers. However, the difficulty is that the whole text may have many sentences start with "the plaintiff" and ends with the punctuation "。". How can I only extract the first appearance of the sentence with plaintiff birth and residence information? The order is not fixed, the punctuation is always used.

For example, the whole text may also have sentence like "the plaintiff declares that he is wrong。" The pattern given in the previous answer will also extract this sentence, which I do not want.

This will depend on how consistent your data is. Does it always have one sentence about the plaintiff, one sentence about the defendant and one sentence about the nature of the complaint? — G5W
– G5W, Commented Dec 21, 2022 at 15:35
The order is not fixed, the "plaintiff" and "defendant" information may comes at the second/third/fourth..sentence. However, it always have one sentence describing these information, then i need to extract them. — Xinyan LIU
– Xinyan LIU, Commented Dec 22, 2022 at 4:14

Andre Wildberg · Accepted Answer · 2022-12-22 10:47:51Z

1

An approach using str_extract and sub. The substitution removes any follow up sentences, if they exists. So the detected plaintiff and defendant can only be one sentence long (。 as the separator).

library(dplyr)
library(stringr)

tibble(text) %>% 
  mutate(plaintiff = sub("(。).*", "\\1", str_extract(text, "The plaintiff.*。")), 
         defendant = sub("(。).*", "\\1", str_extract(text, "The defendant.*。")), 
         law = grepl("《marriage law》", text)) %>% 
  print(Inf)
# A tibble: 1 × 4
  text                                                     plain…¹ defen…² law  
  <chr>                                                    <chr>   <chr>   <lgl>
1 "Ganluo County People's Court of X Province。The plaint… The pl… The de… TRUE 
# … with abbreviated variable names ¹plaintiff, ²defendant

full output

# A tibble: 1 × 4
  text                                                                          
  <chr>                                                                         
1 "Ganluo County People's …
  plaintiff                                                                  
  <chr>                                                                      
1 The plaintiff X, female, born on May, 1980, lives in X County, X Province。
  defendant                                                                
  <chr>                                                                    
1 The defendant X, male, born on May, 1971, lives in X County, X Province。
  law  
  <lgl>
1 TRUE

extended data

text <- "Ganluo County People's Court of X Province。The plaintiff X, female, born on May, 1980, lives in X County, X Province。The defendant X, male, born on May, 1971, lives in X County, X Province。The plaintiff wuen weofioi woe fowie fowie fowei f。The defendant wuen weofioi woe fowie fowie fowei f。The plaintiff wuen weofioi woe fowie fowie fowei f。The defendant wuen weofioi woe fowie fowie fowei f。The plaintiff wuen weofioi woe fowie fowie fowei f。The plaintiff wuen weofioi woe fowie fowie fowei f。\nIt is a divorce dispute, according to 《marriage law》on June 21, 2016。"

edited Dec 22, 2022 at 10:47

answered Dec 21, 2022 at 15:49

Andre Wildberg

19.9k4 gold badges20 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Xinyan LIU Over a year ago

Thank you for your answer. However, the difficulty is that the whole text may have many sentences with the similar pattern. How can i just extract the first appearance of the pattern?

Andre Wildberg Over a year ago

@XinyanLIU Gonna check...

Andre Wildberg Over a year ago

@XinyanLIU I extended the example and put in more "The plaintiff" and "The defendant" and it already just prints the first appearance, so should work. If it doesn't with your real data please adjust the example so I can test it with that.

Kat · Accepted Answer · 2022-12-22 15:44:54Z

1

Update

With the additional information you've provided, see if this works for you.

This assumed that there is only once sentence each for the plaintiff and the defendant. I've added .* at the end of the 'rovince' (as in Province). That is so that if Province is not the end of the sentence, it still captures the entire sentence. I left off the P so that if capitalization is inconsistent, it doesn't matter.

I've used [^。]+ to capture anything except a period so it can only capture one sentence.

It still assumes that the sentence begins with "The plaintiff" (or defendant).

If this does not work, you'll really need to provide several more examples of potential content.

library(tidyverse)

td3 <- data.frame(oText = text) %>% 
  extract(into = c('plaintiff', 'defendent'), remove = F, col = oText,
          regex = "^.*(The plaintiff[^。]+rovince.*。).*(The defendant[^。]+rovince.*。).*") %>% 
  mutate(law = str_detect(oText, 'marriage law'))

Originally...

How tight are the patterns you've shown here? Is the plaintiff always in the second sentence? Does the defendant's description always follow the plaintiff? Is punctuation always used?

Here's a method that works with this data. This method does not assume any given order, but it does assume punctuation was used.

In the regex used you see 'The plaintiff' (or defendant), followed by .*, which means followed by anything, then ?, which tells us that we want the first occurrence of the lookahead. The lookahead, or where we want the regex to stop looking, is documented in (?= ). You have oddly encoded 。at the end of the sentences (assuming this was translated).

If you have periods or another recognized special character in your real data, you'll have to escape it. In this regex, you saw that the period followed by the asterisk was coding for ...and anything else... so if you were looking for a period or an asterisk, you'd have to 'escape' it so that the regex process knows that you meant the character literally.

library(tidyverse)
library(stringi)

tdf <- data.frame(oText = text) %>% 
  mutate(plaintiff = stri_extract_first_regex(oText, 'The plaintiff.*?(?=(。))'),
         defendent = stri_extract_first_regex(oText, 'The defendant.*?(?=(。))'),
         law = str_detect(oText, 'marriage law'))

If the patterns are strict, you could probably use dplyr::separate to make this even easier.

edited Dec 22, 2022 at 15:44

answered Dec 21, 2022 at 15:41

Kat

19k3 gold badges24 silver badges63 bronze badges

3 Comments

Xinyan LIU Over a year ago

Thank you for your answer. However, the difficulty is that the whole text may have many sentences start with "the plaintiff" and ends with the punctuation "。". How can I only extract the first appearance of the sentence with plaintiff birth and residence information? The order is not fixed, the punctuation is always used. Thanks!

Xinyan LIU Over a year ago

That is, the whole text may also have sentence like "the plaintiff declares that he is wrong。" The pattern given in the previous answer will also extract this sentence, which I do not want.

Kat Over a year ago

I've added an update to my answer. See if this version is sufficient. If it is not, can you please provide more examples, particularly where and how it fails?

Collectives™ on Stack Overflow

How to extract strings from a full text using R?

2 Answers 2

extended data

3 Comments

Update

Originally...

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

extended data

3 Comments

Update

Originally...

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related