I have around 100 text files (but expected to grow) that I need to extract the containing data from.
The text files have a format at the moment two specific format like (but expected to change in the future):
From: sender name
Sent: 16 May 2017 15:54
To: receiver date
Subject: Text
Task: task1
Date: 'APR-17'
Entity: '1234'
Account: '%'
Branch: '%'
CostCenter: '%'
Product: '%'
InterCo: '%'
or
From: sender name
Sent: 16 May 2017 15:54
To: receiver date
Subject: Text
Task: task2
Date: APR-17
Entity: ename
What is the best way to extract data in R to convert it into a structure dataset to analyse it?
Is there a specific library or function I could take advantage of? Are there any examples I could get started from?
tmpackage, load all the texts as a corpus then create a transform to apply to the corpus that extracts the data from each page and use that output to create a data.table.tmhas a ton of configuration options so you can tailor it to your needs