0

I have around 100 text files (but expected to grow) that I need to extract the containing data from.

The text files have a format at the moment two specific format like (but expected to change in the future):

From:   sender name
Sent:   16 May 2017 15:54
To: receiver date
Subject:    Text

Task: task1
Date: 'APR-17'
Entity: '1234'
Account: '%'
Branch: '%'
CostCenter: '%'
Product: '%'
InterCo: '%'

or

From:   sender name
Sent:   16 May 2017 15:54
To: receiver date
Subject:    Text

Task: task2
Date: APR-17
Entity: ename

What is the best way to extract data in R to convert it into a structure dataset to analyse it?

Is there a specific library or function I could take advantage of? Are there any examples I could get started from?

3
  • You will have to parse the file. A good start would be using regular expressions Commented May 25, 2017 at 7:35
  • 1
    there are several ways to go about this, as is the R manner of things... Personally I'd probably use the tm package, load all the texts as a corpus then create a transform to apply to the corpus that extracts the data from each page and use that output to create a data.table. tm has a ton of configuration options so you can tailor it to your needs Commented May 25, 2017 at 7:55
  • You should consider the tidytext package, because it fits with the tidyverse and is much easier to understand than the tm package. Commented May 25, 2017 at 10:12

1 Answer 1

3

I would do something like this. You might need to modify it depending on your data.

library(stringr) #for splitting and trimming raw data
library(tidyr) #for converting to wide format

#read files into a list of vectors (assuming filenames is a vector of names of your text files)
datalist <- lapply(filenames,readLines)

#convert each element of the list into a data frame
datalist <- lapply(1:length(datalist),function(i) data.frame(
                          caseno=i, #to identify source of each line
                          rawdata=datalist[[i]],
                          stringsAsFactors = FALSE))

#bind these into a single data frame
df <- do.call(rbind,datalist)

#split the rawdata at the first ':' into type and entry, and trim spaces
df[,c("type","entry")] <- str_trim(str_split_fixed(df$rawdata,":",2))

#convert from 'long' to 'wide' format - the types become column headings
df <- df[,c("caseno","type","entry")]
df <- spread(df,key=type,value=entry)

df should be a single data frame containing a case no, and the values of each entry type as columns. It will probably need a little tidying up afterwards - stringr will be useful for that.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Andrew. This gives me something to start with.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.