R Text file data extraction

Question

I have around 100 text files (but expected to grow) that I need to extract the containing data from.

The text files have a format at the moment two specific format like (but expected to change in the future):

From:   sender name
Sent:   16 May 2017 15:54
To: receiver date
Subject:    Text

Task: task1
Date: 'APR-17'
Entity: '1234'
Account: '%'
Branch: '%'
CostCenter: '%'
Product: '%'
InterCo: '%'

or

From:   sender name
Sent:   16 May 2017 15:54
To: receiver date
Subject:    Text

Task: task2
Date: APR-17
Entity: ename

What is the best way to extract data in R to convert it into a structure dataset to analyse it?

Is there a specific library or function I could take advantage of? Are there any examples I could get started from?

You will have to parse the file. A good start would be using regular expressions — Carles Mitjans
– Carles Mitjans, Commented May 25, 2017 at 7:35
there are several ways to go about this, as is the R manner of things... Personally I'd probably use the tm package, load all the texts as a corpus then create a transform to apply to the corpus that extracts the data from each page and use that output to create a data.table. tm has a ton of configuration options so you can tailor it to your needs — Phi
– Phi, Commented May 25, 2017 at 7:55
You should consider the tidytext package, because it fits with the tidyverse and is much easier to understand than the tm package. — lawyeR
– lawyeR, Commented May 25, 2017 at 10:12

Andrew Gustar · Accepted Answer · 2017-05-25 08:36:42Z

3

I would do something like this. You might need to modify it depending on your data.

library(stringr) #for splitting and trimming raw data
library(tidyr) #for converting to wide format

#read files into a list of vectors (assuming filenames is a vector of names of your text files)
datalist <- lapply(filenames,readLines)

#convert each element of the list into a data frame
datalist <- lapply(1:length(datalist),function(i) data.frame(
                          caseno=i, #to identify source of each line
                          rawdata=datalist[[i]],
                          stringsAsFactors = FALSE))

#bind these into a single data frame
df <- do.call(rbind,datalist)

#split the rawdata at the first ':' into type and entry, and trim spaces
df[,c("type","entry")] <- str_trim(str_split_fixed(df$rawdata,":",2))

#convert from 'long' to 'wide' format - the types become column headings
df <- df[,c("caseno","type","entry")]
df <- spread(df,key=type,value=entry)

df should be a single data frame containing a case no, and the values of each entry type as columns. It will probably need a little tidying up afterwards - stringr will be useful for that.

edited May 25, 2017 at 8:36

answered May 25, 2017 at 8:29

Andrew Gustar

18.6k1 gold badge26 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Selrac Over a year ago

Thanks Andrew. This gives me something to start with.

Collectives™ on Stack Overflow

R Text file data extraction

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related