Final Project

Introduction

Similar to common currencies, cryptocurrencies can be used to buy products and services. However, unlike other currencies, cryptocurrencies are digital and use cryptography to provide secure online transactions. Much of the interest in these unregulated currencies is to trade for profit even though cryptocurrencies can be in used regular purchase. Speculators at times drove prices of cryptocurrencies skyward.

In this tutorial, we will cover the entire data science pipeline: data curation, parsing, and management; exploratory data analysis; hypothesis testing and machine learning. We will use R to carry out various techniques of predicting financial market movements, specifically the volume and high price of different cryptocurrencies trade.

Preface and Background

Cryptocurrency is a form of payment that can be exchanged online for products and services. Many companies have issued their own cryptocurrencies, often called tokens, which can be used specifically to purchase the products and services provided by the company. Cryptocurrency can be considered similar to arcade tokens or casino chips, which need to be purchased or exchanged with real currency.

Cryptocurrencies use a technology called blockchain, which is a decentralized technology spread across many computers that manages and records transactions. One of the appealing characteristics of cryptocurrencies is its security.

According to CoinMarketCap.com, a market research website, more than 2,200 different cryptocurrencies are traded publicly. The total value of all cryptocurrencies as of June 2019 was estimated to be $246 billion. Since the cryptocurrency market is still fairly young, the current stock prices of cryptocurrencies can vary quite a bit. We hope to do some analysis on the following types of cryptocurrencies to see if we can predict some current market trends: Bitcoin, Litecoin and Ethereum.

For more information on how cryptocurrencies work: https://blockgeeks.com/guides/what-is-cryptocurrency/
For more information on cryptocurrency prices: https://cointelegraph.com/explained/how-cryptocurrency-prices-work-explained

Created in 2009 by Satoshi Nakamoto, Bitcoin was the first widely adopted cryptocurrency. Bitcoin not only uses peer-to-peer technology to operate with no central authority or banks, but also collectively manages transactions and bitcoin’s issuance. Bitcoin is open-source; its design is public, nobody owns or controls Bitcoin and everyone can take part. Nowadays, Bitcoin has become synonymous with cryptocurrency, however, this is not the only type of cryptocurrency available. (https://medium.com/decryptionary/what-is-bitcoin-for-dummies-a-guide-for-beginners-8b3d9c0a8065)

Litecoin is one of the most prominent alternatives to Bitcoin and works upon the same fundamental principles. The transaction time of Litecoin is roughly two and a half minutes faster than that of Bitcoin. With four times as many Litecoins in circulation, it theoretically offers smaller divisions of coins to make smaller transaction values more feasible. Litecoin also uses a hashing algorithm, known as scrypt, which is supposed to keep Litecoin mining realistic for desktop users. Bitcoin uses the standard SHA256d algorithm which becomes more time as well as power intensive as time goes on. (https://www.forbes.com/sites/quora/2018/02/08/what-is-litecoin/#6b7bf4c333f7) (https://www.wired.com/2013/08/litecoin/)

Ethereum is a cryptocurrency that took the technology behind bitcoin and expanded its capabilities. Ethereum is a decentralized network with its own interbet browser, coding language, and payment system. Etheruem ustilizes a peer-to-peer approach, where nodes download the Etheruem blockchain and enforce all the rules of the system, which allows the network to be honest and the nodes to recieve rewards. Ethereum seems to be a complex mode of cryptocurrency exchange. It will be interesting comparing it with other cryptocurrencies. (https://cointelegraph.com/ethereum-for-beginners/what-is-ethereum)

Environment Setup

Libraries used to conduct our tutorial through the data science pipeline.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(lubridate)
library(caret)
library(broom)
library(pracma)
library(caret)
library(rpart)
library(rpart.plot)
library(plotROC)

Data Curation and Parsing

We are using datasets from https://www.kaggle.com/sudalairajkumar/cryptocurrencypricehistory#bitcoin_price.csv. This dataset has multiple .csv files for each type of cryptocurrency. In this specific tutorial we will be looking at three of these .csv files corresponding to three different cryptocurrencies. The cryptocurrencies we will be looking at are: Bitcoin, Ethereum, and Litecoin. We chose these three in particular because these were the ones that we had heard of and we knew they were fairly popular. CoinMarketCap, a website that tracks the value and performance of various cryptocurrencies, has all three of the chosen cryptocurrencies for this tutorial all within the top 10.

After downloading the dataset from kaggle, we put each csv into its own table using the function read_csv.

bitcoin_tab <- read_csv("datasets/bitcoin_price.csv")
litecoin_tab <- read_csv("datasets/litecoin_price.csv")
ethereum_tab <- read_csv("datasets/ethereum_price.csv")

We then give each entity an identifier of its type, meaning every entity in the Litecoin table will have an attribute labeling it as Litecoin. This allows us to determine the type of the coin once we combine them all ito one table.

bitcoin_tab <- mutate(bitcoin_tab, coin_type="bitcoin")
litecoin_tab <- mutate(litecoin_tab, coin_type="litecoin")
ethereum_tab <- mutate(ethereum_tab, coin_type="ethereum")

Next, we combined all three tables into one for easier data manipulation. All three datasets had the same column names making it easy to combine the tables together. We then used a lubridate function that made the Data column more useful for comparison and then we sort by the date.

crypto_tab <- bitcoin_tab %>%
  rbind(litecoin_tab) %>%
  rbind(ethereum_tab) %>%
  mutate(Date=mdy(Date)) %>% 
  arrange(desc(Date))

head(crypto_tab)

## # A tibble: 6 x 8
##   Date         Open   High    Low  Close     Volume `Market Cap`   coin_type
##   <date>      <dbl>  <dbl>  <dbl>  <dbl>      <dbl> <chr>          <chr>    
## 1 2018-02-20 11232. 11958. 11232. 11404. 9926540000 1.89536e+11    bitcoin  
## 2 2018-02-20   223.   254.   223.   233. 1739670000 12335100000    litecoin 
## 3 2018-02-20   944.   965.   893.   895. 2545260000 92,206,500,000 ethereum 
## 4 2018-02-19 10553. 11274. 10513. 11225. 7652090000 1.78055e+11    bitcoin  
## 5 2018-02-19   215.   227.   215.   223.  767597000 11907900000    litecoin 
## 6 2018-02-19   922.   958.   922.   944. 2169020000 90,047,700,000 ethereum

In the end, this results in a data table with all values from three different cryptocurrency tables, where each entity can be uniquely identified by the Date and the coin_type.

Data Management

Currently in our dataset, we have the following attributes:

Date - Date refers to the calendar date for the particular row - 24 hours midnight to midnight
Open - Open is what the price was at the beginning of the day
High - Highest recorded trading price of the day
Low - Lowest recorded trading price of the day
Close - Close is what the price was at the end of the day
Volume - Volume represents the monetary value of the currency traded in a 24 hour period, denoted in USD
Market Cap - Market Cap is circulating supply x price of the coin. For example, if you have 100 coins that are worth $10 each your market cap is $1,000
Coin Type - Type of cryptocurrency

We want to add the following new attributes for our data analysis later on: close ratio, spread, difference and volume multiplier.

Close ratio is the daily close rate, min-maxed with the high and low values for the day. Close Ratio = (Close-Low)/(High-Low)
Spread is the $USD difference between the high and low values for the day.
Difference is the $USD difference between the opening and closing values for the day.
Volume Multiplier is the constant relationship between spread and trading volume. Volume Multiplier = Volume/Spread

crypto_tab = crypto_tab %>%
  mutate(close_ratio=(Close-Low)/(High-Low)) %>%
  mutate(spread=High-Low) %>%
  mutate(vol_mult = Volume/spread) %>%
  mutate(diff=Close-Open) %>%
  drop_na

head(crypto_tab)

## # A tibble: 6 x 12
##   Date         Open   High    Low  Close Volume `Market Cap` coin_type
##   <date>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr>        <chr>    
## 1 2018-02-20 11232. 11958. 11232. 11404. 9.93e9 1.89536e+11  bitcoin  
## 2 2018-02-20   223.   254.   223.   233. 1.74e9 12335100000  litecoin 
## 3 2018-02-20   944.   965.   893.   895. 2.55e9 92,206,500,~ ethereum 
## 4 2018-02-19 10553. 11274. 10513. 11225. 7.65e9 1.78055e+11  bitcoin  
## 5 2018-02-19   215.   227.   215.   223. 7.68e8 11907900000  litecoin 
## 6 2018-02-19   922.   958.   922.   944. 2.17e9 90,047,700,~ ethereum 
## # ... with 4 more variables: close_ratio <dbl>, spread <dbl>, vol_mult <dbl>,
## #   diff <dbl>

Exploratory Data Analysis

Now that we have our data all nice and sorted, let’s begin some exploratory data analysis! We’re going to want to analyze the trends and directions in the market values for our cryptocurrencies.

A great way to start analyzing our data is looking at the minimum and maximum values of volume and spread so we know where our bounds are. Mean, median and variance are also important measures of central tendency which can tell us about the averages, where the majority of the data lie, and the spread of our data.

# Bitcoin Analysis
bitcoin = crypto_tab %>%
  filter(coin_type=='bitcoin')

cat(" Volume Min - ", min(bitcoin$Volume), "\n",
"Volume Max - ", max(bitcoin$Volume), "\n",
"Spread Min - ", min(bitcoin$spread), "\n",
"Spread Max - ", max(bitcoin$spread), "\n",
"Close Ratio Mean - ", mean(bitcoin$close_ratio), "\n",
"Close Ratio Median - ", median(bitcoin$close_ratio), "\n",
"Close Ratio Variance - ", var(bitcoin$close_ratio))

##  Volume Min -  2857830 
##  Volume Max -  23840900000 
##  Spread Min -  1.03 
##  Spread Max -  4110.4 
##  Close Ratio Mean -  0.5419637 
##  Close Ratio Median -  0.5685072 
##  Close Ratio Variance -  0.08920444

# Ethereum Analysis
ethereum = crypto_tab %>%
  filter(coin_type=='ethereum')

cat(" Volume Min - ", min(ethereum$Volume), "\n",
"Volume Max - ", max(ethereum$Volume), "\n",
"Spread Min - ", min(ethereum$spread), "\n",
"Spread Max - ", max(ethereum$spread), "\n",
"Close Ratio Mean - ", mean(ethereum$close_ratio), "\n",
"Close Ratio Median - ", median(ethereum$close_ratio), "\n",
"Close Ratio Variance - ", var(ethereum$close_ratio))

##  Volume Min -  102128 
##  Volume Max -  9214950000 
##  Spread Min -  0.018745 
##  Spread Max -  417.09 
##  Close Ratio Mean -  0.5254825 
##  Close Ratio Median -  0.5183181 
##  Close Ratio Variance -  0.08526698

# Litecoin Analysis
litecoin = crypto_tab %>%
  filter(coin_type=='litecoin')

cat(" Volume Min - ", min(litecoin$Volume), "\n",
"Volume Max - ", max(litecoin$Volume), "\n",
"Spread Min - ", min(litecoin$spread), "\n",
"Spread Max - ", max(litecoin$spread), "\n",
"Close Ratio Mean - ", mean(litecoin$close_ratio), "\n",
"Close Ratio Median - ", median(litecoin$close_ratio), "\n",
"Close Ratio Variance - ", var(litecoin$close_ratio))

##  Volume Min -  481714 
##  Volume Max -  6961680000 
##  Spread Min -  0.01 
##  Spread Max -  132.82 
##  Close Ratio Mean -  0.4936874 
##  Close Ratio Median -  0.5 
##  Close Ratio Variance -  0.09009817

After gathering our data, we can clearly see there is a large range in volume and spread. The close ratio seems to have little variance, meaning the data points are close to the mean. However, this is still a little difficult to interpret so let’s look at some visualizations.

crypto_tab %>%
  ggplot(aes(x=Date, y=Open, color=coin_type)) +
    geom_line()  +
    labs(title = "Opening Price over Time",
         x = "Date",
         y = "Open")

Here, we see that opening prices of cryptocurrencies increase over time, which makes sense due to rising popularity. Bitcoin appears to have a significantly higher opening prices compared to the other cryptocurencies (altcoin). We also notice that in late 2017 the opening price for Bitcoin significantly dropped with a slight revival in early 2018. The overall trend is a positive and increasing, linear for Ethereum and Litecoin.

crypto_tab %>%
  ggplot(aes(x=Date, y=Close, color=coin_type)) +
    geom_line()  +
    labs(title = "Closing Price over Time",
         x = "Date",
         y = "Closing")

Looking at the closing prices, we see roughly the same trends as in the opening prices.

Now, we are going to plot the difference between opening and closing prices to see the best and worst days for each cryptocurrency.

crypto_tab %>%
  group_by(coin_type) %>%
  arrange(desc(diff)) %>%
  ggplot(aes(x=Date, y=diff, color=coin_type)) +
  geom_line() +
  labs(title = "Difference in Closing Price vs Opening Price over Time",
         x = "Date",
         y = "Difference")

Based on the chart, The end of 2017 had a huge boom with the best daily differences occuring in December 2017. In early 2018, however, all three cryptocurrencies experience their worst daily loss, signifying the event called the 2018 cryptocurrency crash, which occured from January 6, 2016 to February 6, 2018. In this period Bitcoin fell by about 65%, after an unprecedented boom in 2017. (en.wikipedia.org/wiki/2018_cryptocurrency_crash)

Next, let’s look at spread which is the difference between the high and low for that date. This will tells us more about the volatility of each cryptocurrency.

crypto_tab %>%
  group_by(coin_type) %>%
  arrange(desc(spread)) %>%
  ggplot(aes(x=Date, y=spread, color=coin_type)) +
  geom_line() +
  labs(title = "Spread over Time",
         x = "Date",
         y = "Spread")

Now, we are going to look at close ratio over time.

crypto_tab %>%
  ggplot(aes(x=Date, y=close_ratio, color=coin_type)) +
  geom_point() +
  geom_smooth(method='lm') +
  labs(title = "Close Ratio over Time",
         x = "Date",
         y = "Close Ratio")

Looking at this plot, the close ratio seems to have a slight positive increasing linear trend over time, but this doesn’t really tell us much.

Let’s try looking at volume next!

crypto_tab %>%
  ggplot(aes(x=Date, y=Volume, color=coin_type)) +
  geom_line() +
  labs(title = "Volume over Time",
         x = "Date",
         y = "Volume")

The distribution of volume over time is skewed left which makes sense since trading of cryptocurrencies increases over time. This also tells us that from 2014 to early 2016 there was little trading compared to post 2017.

Hypothesis Testing

From now on, we will want to filter our data to be post-2017 since the cryptocurrency exchange was bankrupt during 2014 and was still being reestablished through late 2016.

crypto_tab = crypto_tab %>%
  filter(Date >= "2017-01-01") 
  
head(crypto_tab)

## # A tibble: 6 x 12
##   Date         Open   High    Low  Close Volume `Market Cap` coin_type
##   <date>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr>        <chr>    
## 1 2018-02-20 11232. 11958. 11232. 11404. 9.93e9 1.89536e+11  bitcoin  
## 2 2018-02-20   223.   254.   223.   233. 1.74e9 12335100000  litecoin 
## 3 2018-02-20   944.   965.   893.   895. 2.55e9 92,206,500,~ ethereum 
## 4 2018-02-19 10553. 11274. 10513. 11225. 7.65e9 1.78055e+11  bitcoin  
## 5 2018-02-19   215.   227.   215.   223. 7.68e8 11907900000  litecoin 
## 6 2018-02-19   922.   958.   922.   944. 2.17e9 90,047,700,~ ethereum 
## # ... with 4 more variables: close_ratio <dbl>, spread <dbl>, vol_mult <dbl>,
## #   diff <dbl>

Let’s take another look at volume and spread.

crypto_tab %>%
  ggplot(aes(x=Date, y=Volume, color=coin_type)) +
  geom_line() +
  geom_smooth() +
  labs(title = "Volume over Time",
         x = "Date",
         y = "Volume")

crypto_tab %>%
  group_by(coin_type) %>%
  arrange(desc(spread)) %>%
  ggplot(aes(x=Date, y=spread, color=coin_type)) +
  geom_line() +
  geom_smooth() +
  labs(title = "Spread over Time",
         x = "Date",
         y = "Spread")

Looking at these two attributes, volume and spread, we can see they both have a positive, somewhat linear increasing trend. Let’s see if spread and volume have a linear relationship as well.

crypto_tab %>%
  group_by(coin_type) %>%
  ggplot(aes(x=spread, y=Volume, color=coin_type)) +
  geom_line() +
  geom_smooth(method='lm') +
  labs(title = "Spread vs. Volume",
         x = "Spread",
         y = "Volume")

This leads us to the null hypothesis: there is no relationship between spread and volume. Our hypothesis being there is a linear relationship between spread and volume.

reg_model <- lm(Volume~spread, data=crypto_tab)
reg_model_stats <- reg_model %>%
  tidy()

reg_model_stats

## # A tibble: 2 x 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 608325080. 45738280.      13.3 7.69e-38
## 2 spread        6841360.   100895.      67.8 0.

We reject the null hypothesis since the p-value (0.00) is less than our alpha of (.05). The p-value is the probability that we observe our sample results given the null hypothesis is true, meaning there is no relationship between volume and spread.

Now, we are going to take a look at the volume multiplier.

crypto_tab %>%
  group_by(coin_type) %>%
  ggplot(aes(x=spread, y=vol_mult, color=coin_type)) +
  geom_point() +
  geom_smooth(method='lm') +
  labs(title = "Spread vs. Volume Multiplier",
         x = "Spread",
         y = "Volume Multiplier")

Looking at this graph, we confirm our hypothesis that we can predict the volume based on the spread. Since the volume multiplier is constant, we can use that number to predict volume based on spread, but this only seems to works for Bitcoin.

Machine Learning: Decision Tree

We start our exploration into machine learning with Decsision Trees. Decision Trees are quite simple. You traverse the tree by starting at the top and making a decision based on a boolean statement. If your parameters return true at a fork, traverse left down the tree, else go down the right tree. You continue this pattern of decisions until you reach a leaf, and within this leaf should be the predicted value.

We will now show the decision tree in action. In our Exploratory Data Analysis, we noticed a relationship between High and time, along with Volume and spread. We explored the relationship of volume and spread in our hypothesis test and used a linear model because the relationship was more or less linear. For the relationship between High and time, we found that the relationship would be better modeled not by a liner regreesion model, but instead a decision tree model. In this example, we want to see if we can predict the High value of the cryptocurrency based on its coin type and date using a decision tree.

First, we will make and show a decision tree to give an overall idea of the general structure, then we will see if we can make a decision tree model that can accurately make predictions.

First, lets make a tree. We start by calling rpart, which “grows” a tree given the data parameters we wish to use, the data with which we want the tree to be built upon, and a control function. Here, we will use a built in control function that stops computation once a cp value is reached, but this control function can be replaced with one that you write or another built in one. The purpose of limiting the complexity is to make a model that is complex enough to be acurate, but not too complex to only work with this dataset.

tree <- rpart(High~coin_type+Date, data = crypto_tab, control = rpart.control(cp = 0.0001))

printcp(tree)

## 
## Regression tree:
## rpart(formula = High ~ coin_type + Date, data = crypto_tab, control = rpart.control(cp = 1e-04))
## 
## Variables actually used in tree construction:
## [1] coin_type Date     
## 
## Root node error: 1.6537e+10/1248 = 13251117
## 
## n= 1248 
## 
##            CP nsplit rel error   xerror      xstd
## 1  0.43230594      0 1.0000000 1.000788 0.0888335
## 2  0.04608862      2 0.1353881 0.137361 0.0109043
## 3  0.02609128      3 0.0892995 0.089706 0.0095183
## 4  0.00687412      5 0.0371170 0.042734 0.0039619
## 5  0.00643944      6 0.0302428 0.038965 0.0038510
## 6  0.00526295      7 0.0238034 0.029050 0.0037583
## 7  0.00239510      8 0.0185404 0.026841 0.0034125
## 8  0.00230675      9 0.0161453 0.023069 0.0025785
## 9  0.00173961     10 0.0138386 0.021017 0.0024249
## 10 0.00161309     11 0.0120990 0.019536 0.0026873
## 11 0.00119971     12 0.0104859 0.018050 0.0027065
## 12 0.00092275     13 0.0092862 0.016755 0.0026714
## 13 0.00085547     14 0.0083634 0.016502 0.0026662
## 14 0.00033304     15 0.0075080 0.015603 0.0026578
## 15 0.00032685     16 0.0071749 0.015184 0.0026385
## 16 0.00017249     18 0.0065212 0.014469 0.0026407
## 17 0.00015466     19 0.0063487 0.014571 0.0026438
## 18 0.00011384     20 0.0061941 0.014509 0.0026437
## 19 0.00010000     21 0.0060802 0.014342 0.0026429

The chart produced is showing the error and standard deviation as the number of splits is increased until we reach the cp value in designated in out control function.

Next, we find the tree with the best cp, which is the cp value that is the smallest and should be the same value as the cp value given in our control function. We then prune the tree, to simplify it and remove unneccessary forks and complexity. Next, we print our tree using prp, which will print out the tree in a pretty format.

bestcp <- tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]

tree.pruned <- prune(tree, cp = bestcp)

prp(tree.pruned, faclen = 0, cex = 0.45, extra = 1)

As you can see from this tree, the first fork splits into two trees based on if the coin type was Litecoin or Ethereum, or not. In our chart for high over time, Litecoin and Ethereums’ High values are relatively close to each other compared to Bitcoin’s, so it makes sense for this to be the initial split that the tree makes. You can see further down on the tree on the left that eventually it does split Ethereum and Litecoin, but not for a little bit, signifying their similarities in high values.

Now, we will apply the production of decision trees into a model for predicting. To accomplish this we split our data into a partition of data that we will train the model on and data we train the model on. We also remove unneccessary attributes that are not needed in our prediction of the High value based on Date and Coin Type. After that, we create the model based on our training set using the same function we used in the above example. We then run our test data through the model to predict the High for each entity. I then calculate the squared error for each value, which is $(predicted - actual)^2$, sum up all these squared errors and take the squared root. I judge the accuracy of my model based on this root mean squared error value (RMSE). We want to have a small rmse relative to the range of values.

set.seed(123)
prediction_set <- crypto_tab %>%
  select(High,coin_type,Date)
index <- createDataPartition(y=prediction_set$High, p=0.8, list=FALSE)

train_set <- prediction_set[index,]
test_set <- prediction_set[-index,]

decision_tree <- rpart(High~coin_type+Date, data=train_set)
predictions_decision <- predict(decision_tree, test_set)
cbind(test_set, predictions_decision) %>%
  mutate(se = ((High - predictions_decision)^2)/n()) %>%
  summarize(rmse = sqrt(sum(se)))

##       rmse
## 1 692.2562

range(crypto_tab$High)

## [1]     3.77 20089.00

Our model has a RMSE of 692.2562. This isnt a terrible value, but certainly isnt great. This means that on average, our prediction of high is off by 692.2562. With a range of 3.77 to 20089.00 means that this value is relatively ok, but if we were an investor in cryptocurrencies, we wouldn’t want our prediction to have an error of 692, so lets see if we can get it to be smaller.

Another, method for creating and determining the strength of a model is Cross Validation. In this example, we will be using 10 fold cross validation. This means that we split our data into 10 partitions, use 9 partions to train the data, and one partition to test, and we repeat this 10 times until all partions are used to test. We do this to measure the predictive performance of our model

To train a fit for the model, we use the train function included in the caret package. This function first takes in a formula for prediction, which in our case is predicting High based on coin type and date. Then, it takes the data we are using as a parameter along with the method used for the model, and in our case we use a decision tree, so we set our method to be equal to “rpart”. Next the train function takes in a control function. This is were the 10-fold cross validation comes it. We use the trainControl function, and give it parameters, cv and 10 for 10 fold cross validation. We also give the train function a tuneLength to try different default values for the main parameter.

set.seed(123)
train_control <- trainControl(method="cv", number=10)
tree_fit <- train(High~coin_type+Date, data=crypto_tab, method="rpart", trControl=train_control, tuneLength=10)
tree_fit

## CART 
## 
## 1248 samples
##    2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1123, 1124, 1124, 1124, 1124, 1124, ... 
## Resampling results across tuning parameters:
## 
##   cp            RMSE       Rsquared   MAE      
##   0.0005056511   416.0516  0.9873664   202.9845
##   0.0009227508   423.1795  0.9870263   206.9056
##   0.0011584303   427.8349  0.9867351   209.9502
##   0.0026840414   509.3695  0.9812445   241.5726
##   0.0033670391   529.7257  0.9799195   248.5183
##   0.0069974006   657.4560  0.9666881   346.5704
##   0.0078739980   682.6939  0.9646934   373.5105
##   0.0250721937   811.7775  0.9505191   426.3682
##   0.0482244202  1410.7660  0.8265895   852.8178
##   0.2568900090  2791.1213  0.7250260  1806.3586
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.0005056511.

As you can see, most of the models created in cross-validation produce RMSE lower than our model above. A 416.0516 is better than our previously computed 692.25, but is still not entirely great. This shows that the Decsion Tree model is not the most accurate model in predicting. It also shows the complexity of predicting the price of cryptocurrencies over time and that date and coin type are not the only two factors effecting the high value. There are many outside factors that affect the price that are not present in this dataset and probably couldn’t be measured to even be put in a dataset.

Conclusion

Throughout this tutorial, we have gone through the entire data science pipeline exploring our cryptocurrency dataset. We found that there is a linear relationship between spread and volume (for Bitcoin), which will allow us to predict the volume of trade based on the range in trading price for the day. This tells us that depending on the volatility of the cryptocurrency market, we are able to determine the volume of trade for that day. In the future, this may be useful to predict future market positions.

Through the use of decision trees, we explored the possibility for predicting high trading prices depending on the cryptocurrency and date. While this was somewhat successful, the correlation is highly reliant on the current news which impacts cryptocurrency stocks. Just last week, Bitcoin crossed $10,000 for the first time as a result of an upcoming “halving event”. Consequently, past stock prices can not predict future prices because news events have different magnitudes of effect on stock and they happen at arbitrary dates.

Possible future solutions include: monitoring social media platforms for trends, predicting major new events and their magnitudes; or even using sentiment analysis on current news to predict high stock prices. While these techniques are not covered in this particular tutorial, we still have learned a lot about the cryptocurrency market using our data science techniques and possibly venture into actual trading one day. Maybe we might even be able to create an algorithm to do this for us.

Additional Resources

https://www.cnbc.com/2020/05/08/bitcoin-btc-cryptocurrency-prices-rise-as-halving-approaches.html https://towardsdatascience.com/https-towardsdatascience-com-algorithmic-trading-using-sentiment-analysis-on-news-articles-83db77966704 https://nlp.stanford.edu/courses/cs224n/2011/reports/nccohen-aatreya-jameszjj.pdf