Wordcloud

In this small project I wanted to generate a wordcloud from the keywords of my scientific publications, to visualize all my scientific work in one figure.

wordcloud

This project is divided in three parts:

  • Retrieve the complete metadata of all my scientific publications
  • Generate the wordcloud
  • Personalize the word cloud

Retrieve metadata from scientific literature

I have retrieved the metadata for all my scientific publication - up to date - by searching for my author name in the author field of Web of Science. The results of the search were exported in the BibTeX format and saved as savedrecs.bib file.

I have used the R package bibliometrix to access the metadata in the BibTeX file. To do that, first we load the necessary library:

library(bibliometrix)

Then we use the function convert2df to convert the BibTeX file into a dataframe. We also specify the variable dbsource='wos' since we downloaded our metadata from Web of Science (WOS) and define the format="bibtex" since the file was exported in this format.

D <- convert2df("Your-path-to-the-file-location/savedrecs.bib", dbsource = 'wos', format = "bibtex")

According to the structure of the dataframe that we just created, as described in the vignette of the bibliometrix R package, the keywords associated by SCOPUS or ISI database can be found in the column D$ID. The content of this column is the one of interest for the scope of this project.

Now we are ready to extract the contend of the column D$ID. We create a new object keywords which will contain all our keywords and we save this file as .txt file. We specify the variable row.names = FALSE to make sure that the only text containd in our file will be the keyword text.

keywords<-D$ID
write.table(c(keywords), "keywords.txt", row.names = FALSE)

The file that we created above is a text file which contains as many lines as the number of our publications. In each line we will find the keywords corresponding to each one of our publications.

At this point we are ready to generate our word cloud!

Generate the wordcloud

To generate the wordcloud we will need to do some text mining and highlight the most frequently used keywords in our text file.

First we need to load some libraries:

library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

Then we load the text in our keywords.txt file and the data as corpus.

keyword_text <- readLines("keywords.txt")
docs <- Corpus(VectorSource(keyword_text))

Afterwards we clean the text by removing punctuation, numbers and common stopwords in english.

docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeWords, stopwords("english"))

Now we build a term-document matrix which is a table containing the frequency of the words.

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

Finally, we can generate our wordcloud:

set.seed(1234)
jpeg("wordcloud.jpg",width=7,height=7,units="in",res=150)
worcloud<-wordcloud(words = d$word, freq = d$freq, min.freq = 1,
                    max.words=200, random.order=FALSE, rot.per=0.35,
                    colors=brewer.pal(8, "Dark2"))
dev.off()
wordcloud

Personalize the wordcloud

We can decide to rotate the words in the wordcloud to improve readibility. For this we can set the value of the variable rot.per=0 . This will rotate all the words and position them with a 90 degree angle as normally a text appears.

set.seed(1234)
jpeg("wordcloud.jpg",width=7,height=7,units="in",res=150)
worcloud<-wordcloud(words = d$word, freq = d$freq, min.freq = 1,
                    max.words=200, random.order=FALSE, rot.per=0,
                    colors=brewer.pal(8, "Dark2"))
dev.off()
wordcloud

Credits

Credits for the feasibility of this project should be given to the authors of bibliometrix and STHDA which is an icredibly well done training website, with many tutorials on data analysis and visualization using R software and packages.