How to Read in Text File R

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You lot can report outcome most the content on this page here)


Want to share your content on R-bloggers? click here if you lot have a blog, or here if you don't.

In this post, taken from the book R Data Mining past Andrea Cirillo, nosotros'll exist looking at how to scrape PDF files using R. It's a relatively straightforward way to look at text mining – but information technology tin be challenging if you don't know exactly what you're doing.

Until January 15th, every single eBook and video by Packt is just $v! Starting time exploring some of Packt's huge range of R titles here.

Yous may not exist aware of this, only some organizations create something called a 'customer card' for every single customer they deal with. This is quite an informal certificate that contains some relevant information related to the customer, such equally the manufacture and the date of foundation. Probably the almost precious information contained within these cards is the comments they write down about the customers. Let me show you ane of them:
My plan was the following—get the information from these cards and clarify it to notice whether some kind of mutual traits emerge.

Equally you lot may already know, at the moment this information is presented in an unstructured manner; that is, we are dealing with unstructured data. Before trying to analyze this data, we will have to gather it in our assay surroundings and requite it some kind of construction.

Technically, what nosotros are going to do here is called text mining, which by and large refers to the activity of gaining noesis from texts. The techniques we are going to utilise are the post-obit:

  • Sentiment analysis
  • Wordclouds
  • N-gram analysis
  • Network analysis

Getting a list of documents in a folder

First of all, we need to get a list of customer cards we were from the commercial section. I accept stored all of them within the 'information' binder on my workspace. Permit'south use list.files() to get them:
file_vector <- list.files(path = "data")
Overnice! We can inspect this looking at the head of information technology. Using the post-obit command:
file_vector %>% caput()  [1] "banking.xls" "Betasoloin.pdf" "Burl Whirl.pdf" "BUSINESSCENTER.pdf"  [5] "Buzzmaker.pdf" "BuzzSaw Publicity.pdf"
Uhm… not exactly what we need. I tin can see at that place are also .xls files. Nosotros tin remove them using the grepl() function, which performs fractional matches on strings, returning TRUE if the design required is found, or FALSE if not. We are going to prepare the following exam here: give me Truthful if you lot find .pdf in the filename, and Simulated if not:
grepl(".pdf",file_list)  [one] Simulated TRUE TRUE TRUE True Truthful Truthful TRUE TRUE FALSE TRUE Truthful TRUE Truthful TRUE TRUE TRUE [xviii] True TRUE Truthful TRUE Truthful True TRUE TRUE Truthful TRUE TRUE True Truthful TRUE Truthful Truthful TRUE [35] TRUE TRUE True TRUE True TRUE TRUE TRUE Truthful TRUE True TRUE TRUE True Truthful Truthful Truthful [52] Truthful TRUE Truthful True False True Truthful TRUE TRUE TRUE TRUE TRUE TRUE True True TRUE Truthful [69] TRUE True TRUE TRUE True TRUE True TRUE TRUE Truthful TRUE True TRUE True TRUE TRUE Truthful [86] TRUE TRUE TRUE True TRUE True TRUE TRUE Truthful True TRUE True TRUE Truthful TRUE True True [103] Truthful Truthful True True TRUE Truthful TRUE True TRUE TRUE True True Truthful TRUE Simulated Imitation Truthful [120] TRUE
Every bit you tin can see, the starting time friction match results in a Fake since it is related to the .xls file nosotros saw earlier. We can now filter our list of files past simply passing these matching results to the list itself. More precisely, nosotros will slice our list, selecting only those records where our grepl() phone call returns Truthful:
pdf_list <- file_vector[grepl(".pdf",file_list)]
Did you lot understand [grepl(".pdf",file_list)] ? It is really a fashion to access i or more indexes inside a vector, which in our example are the indexes corresponding to ".pdf", exactly the same as nosotros printed out before. If you now await at the list, you volition see that only PDF filenames are shown on it.

Reading PDF files into R via pdf_text()

R comes with a really useful that'due south employed tasks related to PDFs. This is named pdftools, and beside the pdf_text function nosotros are going to employ here, it too contains other relevant functions that are used to get different kinds of information related to the PDF file into R. For our purposes, information technology volition be enough to get all of the textual information contained inside each of the PDF files. First of all, permit's try this on a single document; nosotros will try to scale it afterwards the whole gear up of documents. The only required argument to make pdf_text work is the path to the document. The object resulting from this application will exist a grapheme vector of length 1:
pdf_text("data/BUSINESSCENTER.pdf")  [one] "BUSINESSCENTER business concern profile\nInformation beneath are provided under non disclosure understanding. date of enquery: 12.05.2017\ndate of foundation: 1993-05-18\nindustry: Non-profit\nshare holders: Helene Wurm ; Meryl Savant ; Sydney Wadley\ncomments\nThis is i of our worst customer. It really often miss payments even if for just a couple of days. We have\nproblems finding useful contact persons. The only person nosotros can have had occasion to deal with was the\nfiscal practiced, since all other relevant person denied any kind of contact.\north                                                       1\northward"
If you compare this with the original PDF document you can easily run into that all of the information is available even if it is definitely non ready to be analyzed. What do you think is the side by side step needed to make our data more than useful? Nosotros first need to split our string into lines in gild to give our information a structure that is closer to the original one, that is, made of paragraphs. To split our string into separate records, we tin use the strsplit() function, which is a base R office. Information technology requires the string to be split and the token that decides where the string has to be split equally arguments. If you now look at the cord, you'll notice that where we establish the end of a line in the original document, for instance later the words business contour, we at present observe the \n token. This is commonly employed in text formats to marking the finish of a line. We volition therefore apply this token every bit a dissever argument:
pdf_text("data/BUSINESSCENTER.pdf") %>% strsplit(split = "\due north")  [[1]]  [1] "BUSINESSCENTER business concern contour"  [two] "Data below are provided under non disclosure agreement. appointment of enquery: 12.05.2017"  [3] "appointment of foundation: 1993-05-xviii"  [iv] "manufacture: Non-profit"  [5] "share holders: Helene Wurm ; Meryl Savant ; Sydney Wadley"  [vi] "comments"  [vii] "This is one of our worst customer. Information technology really often miss payments even if for only a couple of days. We have"  [eight] "problems finding useful contact persons. The just person nosotros can have had occasion to bargain with was the"  [9] "fiscal expert, since all other relevant person denied any kind of contact."  [ten] " 1"
strsplit() returns a list with an element for each chemical element of the grapheme vector passed as statement; within each list element, there is a vector with the split string. Isn't that better? I definitely call back it is. The last thing we need to do earlier actually doing text mining on our data is to apply those treatments to all of the PDF files and gather the results into a conveniently arranged information frame.

Iteratively extracting text from a gear up of documents with a for loop

What we want to practice here is run trough the list of files and for filename constitute there, nosotros run the pdf_text() function then the strsplit() function to get an object similar to the one nosotros accept seen with our test. A convenient fashion to do this is by employing a 'for' loop. These structures basically do this to their content: Repeat this didactics n times so end. Permit me show y'all a typical 'for' loop:
for (i in 1:3){  print(i) }
If we run this, we obtain the following:
[one] one  [1] two  [one] 3
This means that the loop runs three times and therefore repeats the instructions included within the brackets 3 times. What is the merely thing that seems to change every time? It is the value of i. This variable, which is usually called counter, is basically what the loop employs to empathize when to stop iterating. When the loop execution starts, the loop starts increasing the value of the counter, going from 1 to three in our example. The for loop repeats the instructions between the brackets for each element of the values of the vector following the in clause in the for control. At each step, the value of the variable before in (i in this instance) takes one value of the sequence from the vector itself. The counter is also useful within the loop itself, and it is usually employed to iterate within an object in which some kind of manipulation is desired. Take, for instance, a vector defined like this:
vector <- c(1,ii,three)
Imagine we desire to increase the value of every element of the vector past 1. We can do this by employing a loop such as this:
for (i in 1:iii){  vector[i] <- vector[i]+i }
If you expect closely at the loop, you'll realize that the instruction needs to admission the element of the vector with an alphabetize equal to i and modify this value by 1. The counter here is useful because it will allow iteration on all vectors from 1 to iii. Exist enlightened that this is actually not a best practice considering loops tend to be quite computationally expensive, and they should be employed when no other valid alternative is available. For example, we can obtain the same result here by working straight on the whole vector, every bit follows:
vector_increased <- vector +1
If you are interested in the topic of fugitive loops where they are non necessary, I tin can share with you some relevant material on this. For our purposes, we are going to apply loops to go through the pdf_list object, and apply the pdf_text function and subsequently the strsplit() role to each element of this list:
corpus_raw <- data.frame("company" = c(),"text" = c())  for (i in 1:length(pdf_list)){ print(i)  pdf_text(paste("data/", pdf_list[i],sep = "")) %>%   strsplit("\due north")-> document_text data.frame("visitor" = gsub(10 =pdf_list[i],pattern = ".pdf", replacement = ""),   "text" = document_text, stringsAsFactors = FALSE) -> certificate  colnames(document) <- c("visitor", "text") corpus_raw <- rbind(corpus_raw,document)  }
Let's get closer to the loop: nosotros first have a call to the pdf_text office, passing an element of pdf_list as an argument; it is defined as referencing the i position in the listing. Once nosotros have done this, we can move on to apply the strsplit() function to the resulting cord. Nosotros define the document object, which contains two columns, in this way:
  • visitor, which stores the proper name of the PDF without the .pdf token; this is the name of the company
  • text, which stores the text resulting from the extraction
This certificate object is and so appended to the corpus object, which nosotros created previously, to store all of the text inside the PDF. Allow's have a await a the resulting information frame:
corpus_raw %>% caput()

This is a well-structured object, fix for some text mining. Yet, if we await closely at our PDF customer cards, we can run into that in that location are three different kinds of information and they should be handled differently:
  • Repeated information, such every bit the confidentiality disclosure on the second line and the engagement of inquiry (12.05.2017)
  • Structured attributes, for instance, date of foundation or industry
  • Strictly unstructured information, which is in the comments paragraph
We are going to address these 3 kinds of data differently, removing the commencement group of irrelevant information; we therefore have to split our data frame appropriately into two smaller data frames. To do so, nosotros will leverage the grepl() function once again, looking for the following tokens:
  • 12.05.2017: This denotes the line showing the non-disclosure agreement and the date of inquiry.
  • business profile: This denotes the title of the document, containing the name of the company. Nosotros already accept this information stored within the company column.
  • comments: This is the name of the last paragraph.
  • 1: This represents the number of the page and is e'er the same on every bill of fare.
Nosotros tin can apply the filter function to our corpus_raw object here as follows:
corpus_raw %>%  filter(!grepl("12.05.2017",text)) %>%  filter(!grepl("business organization profile",text)) %>%  filter(!grepl("comments",text)) %>% filter(!grepl("1",text)) -> corpus
Now that we have removed those useless things, we can really split the data frame into two sub-data frames based on what is returned past the grepl role when searching for the post-obit tokens, which point to the structured attributes we discussed previously:
  • appointment of foundation
  • industry
  • shareholders

Nosotros are going to create two different data frames here; one is called 'data' and the other is chosen 'comments':
corpus %>%  filter(!grepl(c("date of foundation"),text)) %>%  filter(!grepl(c( "industry"),text)) %>%  filter(!grepl(c( "share holders"),text)) -> comments  corpus %>%  filter(grepl(("appointment of foundation"),text)|grepl(( "industry"),text)|grepl(( "share holders"),text))-> information
Equally you tin see, the ii data treatments are nearly the opposite of each other, since the first looks for lines showing none of the three tokens while the other looks for records showing at least one of the tokens. Permit's audit the results past employing the caput function:
information %>% caput() comments %>% head()
Great! Nosotros are near done. We are now going to start analyzing the comments data frame, which reports all comments from our colleagues. The very last pace needed to make this data frame ready for subsequent analyzes is to tokenize it, which basically means separating it into different rows for all the words available within the text cavalcade. To do this, we are going to leverage the unnest_tokens part, which basically splits each line into words and produces a new row for each discussion, having taken care to repeat the respective value within the other columns of the original data frame. This function comes from the recent tidytext package by Julia Silge and Davide Robinson, which provides an organic framework of utilities for text mining tasks. Information technology follows the tidyverse framework, which you lot should already know most if y'all are using the dplyr parcel. Permit'due south encounter how to apply the unnest_tokens function to our data:
comments %>%   unnest_tokens(discussion,text)-> comments_tidy
If nosotros now look at the resulting information frame, we can see the following:

As you can see, we now have each word separated into a single tape.

Thanks for reading! We hope you lot enjoyed this extract taken from R Information Mining.

Explore $v R eBooks and videos from Packt until January 15th 2018.

whitehister.blogspot.com

Source: https://www.r-bloggers.com/2018/01/how-to-extract-data-from-a-pdf-file-with-r/

0 Response to "How to Read in Text File R"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel