Files
A comma-separated values (CSV) file is a delimited text file that generally uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by the delimiter. CSV is a common data exchange format that is widely supported by consumer, business, and scientific applications. R makes it easy to export and import data in CSV format.
Local Files
Export data to a csv file
Import data from a csv file
## X mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Remote Files
Some data providers offer data in csv format on their website. The STOXX website, a financial index provider, is one of these. Open this link for the EURO STOXX 50 Index: tab Data -> Historical Data provides some open source files for histroical prices. Clicking on EUR Price will open this link. The read.csv()
function can read this file directly from the internet.
# read.csv is very flexible. For the full list of arguments type ?read.csv
x <- read.csv('https://www.stoxx.com/document/Indices/Current/HistoricalData/h_3msx5e.txt', sep = ';')
head(x)
## Date Symbol Indexvalue X
## 1 24.02.2020 SX5E 3647.98 NA
## 2 25.02.2020 SX5E 3572.51 NA
## 3 26.02.2020 SX5E 3577.68 NA
## 4 27.02.2020 SX5E 3455.92 NA
## 5 28.02.2020 SX5E 3329.49 NA
## 6 02.03.2020 SX5E 3338.83 NA
rownames(x) <- as.Date(x[,1], format = '%d.%m.%Y') # assign rownames
x[,c(1,ncol(x))] <- NULL # drop the first and last column
head(x) # print data
## Symbol Indexvalue
## 2020-02-24 SX5E 3647.98
## 2020-02-25 SX5E 3572.51
## 2020-02-26 SX5E 3577.68
## 2020-02-27 SX5E 3455.92
## 2020-02-28 SX5E 3329.49
## 2020-03-02 SX5E 3338.83
R Packages
The ‘quantmod’ Package
The quantmod
package provides a very suitable function for downloading financial data from the web. This function is called getSymbols
. The function works with a variety of sources.
For stocks and shares, the yahoo
source is used. Symbols can be found here.
# retrieve Facebook quotes
x <- getSymbols(Symbols = 'FB', src = 'yahoo', auto.assign = FALSE)
tail(x)
## FB.Open FB.High FB.Low FB.Close FB.Volume FB.Adjusted
## 2020-05-15 205.27 211.34 204.12 210.88 19383200 210.88
## 2020-05-18 212.15 214.64 210.94 213.19 20167400 213.19
## 2020-05-19 213.27 220.49 212.83 216.88 31843200 216.88
## 2020-05-20 223.50 231.34 223.19 229.97 50162900 229.97
## 2020-05-21 234.72 237.20 231.20 231.39 47782600 231.39
## 2020-05-22 231.51 235.99 228.74 234.91 33913800 234.91
For currencies and metals, the oanda
source is used. Symbols are the instruments’ ISO codes separated by /
. ISO codes can be found here.
# retrieve the historical euro/dollar exchange rate
x <- getSymbols(Symbols = 'EUR/USD', src = 'oanda', auto.assign = FALSE)
tail(x)
## EUR.USD
## 2020-05-17 1.081962
## 2020-05-18 1.085636
## 2020-05-19 1.093088
## 2020-05-20 1.096158
## 2020-05-21 1.096402
## 2020-05-22 1.091122
For economics series, the FRED
source is used. Symbols can be found here.
# retrieve the historical Gross Domestic Product for Japan
x <- getSymbols(Symbols = 'JPNNGDP', src = 'FRED', auto.assign = FALSE)
tail(x)
## JPNNGDP
## 2018-10-01 546279.3
## 2019-01-01 552466.3
## 2019-04-01 555880.5
## 2019-07-01 558132.4
## 2019-10-01 549495.1
## 2020-01-01 545214.6
RESTful APIs
An Application Program Interface (API) is basically a messenger that takes a request, tells a system what you want to do and then returns the response back to you. A RESTful API is an API that uses HTTP requests to GET, PUT, POST and DELETE data. The httr
R package is a useful tool for working with HTTP. Each API has its very specific usage and documentation.
CRAN downloads
The API of the CRAN downloads database. Documentation available here
Example. Which was the most downloaded package of the last month?
baseurl <- 'https://cranlogs.r-pkg.org/' # API base url. See documentation
endpoint <- 'top/' # API endpoint. See documentation
period <- 'last-month/' # API parameter. See documentation
count <- 1 # API parameter. See documentation
url <- paste0(baseurl, endpoint, period, count) # build full url
x <- GET(url) # retrieve url
data <- content(x) # extract data
data # print data
## $start
## [1] "2020-04-22T00:00:00.000Z"
##
## $end
## [1] "2020-05-21T00:00:00.000Z"
##
## $downloads
## $downloads[[1]]
## $downloads[[1]]$package
## [1] "magrittr"
##
## $downloads[[1]]$downloads
## [1] "3306920"
The most downloaded package between 2020-04-22 and 2020-05-21 was magrittr with a total of 3306920 downloads.
KuCoin API
The API of KuCoin, cryptocurrency exchange. Documentation available here
Example. Retrieve and plot Bitcoin price every minute in the last 24 hours.
# set GMT timezone. See documentation
Sys.setenv(TZ='GMT')
# API base url. See documentation
baseurl <- 'https://api.kucoin.com'
# API endpoint. See documentation
endpoint <- '/api/v1/market/candles'
# today and yesterday in seconds
today <- as.integer(as.numeric(Sys.time()))
yesterday <- today - 24*60*60
# API parameters. See documentation
param <- c(symbol = 'BTC-USDT', type = '1min', startAt = yesterday, endAt = today)
# build full url. See documentation
url <- paste0(baseurl, endpoint, '?', paste(names(param), param, sep = '=', collapse = '&'))
# retrieve url
x <- GET(url)
# extract data
x <- content(x)
data <- x$data
# formatting
data <- sapply(1:length(data), function(i) {
# extract single candle
candle <- as.numeric(data[[i]])
# formatting. See documentation
return( c(time = candle[1], open = candle[2], close = candle[3], high = candle[4], low = candle[5]) )
})
# convert to xts
datetime <- as.POSIXct(data[1,], origin = '1970-01-01')
data <- xts(t(data[-1,]), order.by = datetime)
# plot closing values
plot(data$close, main = 'Bitcoin price in dollars')
Web Scraping
Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used. The rvest
package is a useful tool to scrape information from web pages.
Example. Write a function to retrieve articles from Google Scholar given a generic query string q
.
getArticles <- function(q){
# build url
url <- paste0('https://scholar.google.com/scholar?hl=en&q=', q)
# sanitize url
url <- URLencode(url)
# get results
res <- read_html(url) %>% # get url
html_nodes('div.gs_ri h3 a') %>% # select titles by css selector
html_text() # extract text
# return results
return(res)
}
## [1] "Automated data collection with R: A practical guide to web scraping and text mining"
## [2] "Web Scraping With R"
## [3] "RCrawler: An R package for parallel web crawling and scraping"
## [4] "Web scraping with Python: Collecting more data from the modern web"
## [5] "Web scraping and Naïve Bayes classification for job search engine"
## [6] "Web scraping with Python"
## [7] "A primer on theory-driven web scraping: Automatic extraction of big data from the Internet for use in psychological research."
## [8] "The use of web-scraping software in searching for grey literature"
## [9] "R in Action"
## [10] "Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation"