Replicating CRSP Volatility Decile Portfolios in R

 

Introduction

In this post, I provide R code that enables the replication of the Center for Research in Security Prices (CRSP) Volatiliy Deciles using Yahoo! Finance data. This post is related to my last blog post in that it will generate the CRSP low volatility decile portfolio, thereby facilitating the replication of the associated EMA trading strategy.

There are a few caveats to this replication:

  • There will be survivorship bias since Yahoo! Finance does not contain de-listed stocks
  • This code takes a long time to run (+2 hrs on my machine, excluding downloading the symbols), and so it should not all be executed at once
  • The parallel processing libraries in this code run in linux
    • For Windows, “library(snow)”” should be used instead of “library(parallel)” and “library(doSNOW)” should be used instead of “library(doParallel)”
    • A SOCK cluster will likely need to be used instead of a FORK cluster
    • Consequently, objects, functions and libraries will have to explicitly be exported to the cores, thereby consuming more RAM
  • There are +5,000 securities to download from Yahoo! Finance, and any one of them can fail randomly, thereby yielding different results each time the code is executed
    • It is up to the individual user to decide if they want to download the failed symbols
  • The quality control [QC] for this code is not exhaustive, i.e. my goal was to quickly get data for a time-sensitive paper-replication project, thus there is potential for more QC, which I will indicate in the post

This code is related to a paper replication project for a course I took over the summer of 2016 (Advanced Trading System Design). I will be sharing the project in a subsequent post.

To run the following script, one of my packages, titled tsconv (which stands for “Time Series Convenience”) needs to be installed. The following line of code will install the package, provided that the devtools library is already installed.

devtools::install_github("erolbicero/tsconv")

Download Security Tickers

In order to determine the universe of stocks to include, a list of security tickers that trade on the NYSE and NASDAQ is required. This was located by reading the relevant post by Louis Marascio, and as per the discussion, the list of stock tickers that trade on the NYSE and AMEX was then obtained from http://www.nasdaqtrader.com/trader.aspx?id=symboldirdefs on July 8th, 2016.

The file format is a pipe-delimited text file. It contains preferred shares and ETFs, which should be excluded from the replication. Therefore, the file was imported into Microsoft Excel, and then filtered for tickers that were not ETFs, and where the “ACT Symbol” excluded the “$” character, as these were predominantly preferred shares. Lastly, the exchange needed to be part of the NYSE, or NASDAQ, and the only exchange code listed on NASDAQ which is not a part of these exchanges is “Z”, which is the label for BATS Global Markets; therefore any stock with an exchange code “Z” was excluded. After these exclusions, 6,157 ticker symbols remained in the file. The symbols from this cleaned list were exported to a Comma-Separated-Value [CSV] file.

For convenience, I am providing a clean CSV file at this link containing all the tickers I used from the July 8th, 2016 extract.

Load the Tickers and Download the Data

Next, the tickers are loaded into the global environment, and then downloaded one by one from Yahoo! Finance to avoid getting blocked by the website. It may be possible to download more symbols at a time, thereby speeding up the process, however I did not test that to ensure I was not cut off. These are stored in a separate environment titled “stockPriceEnv”. I have added code to save and load the downloaded tickers, since it is time-consuming, taking almost 2 hrs on my machine.

library(quantmod)
library(xts)
library(PerformanceAnalytics)
library(tsconv)
library(stringr)
library(foreach)
library(doParallel)
library(timeDate)

stockSymbols <- read.csv(file = paste0(getwd(),"/StockList2016-07-08.csv"), stringsAsFactors = FALSE, header = FALSE)
stockSymbols <- as.vector(stockSymbols[,1])

stockPriceEnv <- new.env()

successSymbols <- NULL
failSymbols <- NULL

#Use tryCatch since some symbols will cause the function to fail
#Also run painfully slowly so you don't get cut off from Yahoo!
startTime <- Sys.time()
for(stock in  stockSymbols)
{
  tryCatch(expr = {
              getSymbols.yahoo(Symbols = stock
                                , env = stockPriceEnv
                                , from = "1963-07-01" # July 1st, 1963
                                , to = "2016-07-11" # July 11th, 2016

                              )
              successSymbols <- c(successSymbols, stock)

              print(paste(stock, "Succeeds"))

                  }
           , error = function(err){
             print(paste(stock, "Fails"))
             failSymbols <<- c(failSymbols, stock)
           }
  )
}
endTime <- Sys.time()
endTime - startTime
# Time difference of 1.93681 hours

#list all stocks
ls(envir = stockPriceEnv)

#save the downloaded tickers
saveRDS(stockPriceEnv, file = "stockPriceEnv-2016-12-01.RData")

################################################################
# Use this commented code to load stock symbols
#stockPriceEnv <- readRDS("stockPriceEnv-2016-12-01.RData")
#failSymbols <- stockSymbols[!(stockSymbols %in% ls(envir = stockPriceEnv))]
################################################################

The following symbols failed during the download. This is important, since the failed symbols appear to change each time this code is run, which ultimately yields slightly different volatility decile performance. This means that to replicate the numbers in this post, the exact same symbols will need to fail/be present.

Failed Download Symbols

##   [1] "AA"     "AGM.A"  "AKO.A"  "AKO.B"  "APHB"   "ARP"    "ASB.W"
##   [8] "AST.W"  "BAC.A"  "BAC.B"  "BF.A"   "BF.B"   "BGX"    "BIO.B"
##  [15] "BIOA.W" "BNK"    "BRK.A"  "BRK.B"  "BTX.W"  "BWL.A"  "C.A"
##  [22] "CBO"    "CJES"   "CMA.W"  "COF.W"  "CRD.A"  "CRD.B"  "CVM.W"
##  [29] "DEG"    "DWRE"   "DXI"    "DYN.W"  "FCE.A"  "FCE.B"  "FPP.W"
##  [36] "FUR"    "GEF.B"  "GM.B"   "GSI"    "GTN.A"  "HBM.W"  "HEI.A"
##  [43] "HIVE"   "HLS.W"  "HTS"    "HVT.A"  "IBO"    "IHS"    "JW.A"
##  [50] "JW.B"   "KEG"    "KKD"    "KODK.A" "KODK.W" "LEN.B"  "LNC.W"
##  [57] "MKC.V"  "MOG.A"  "MOG.B"  "MSK"    "MTB.W"  "NPD"    "ONE"
##  [64] "PBR.A"  "PVCT.W" "QIHU"   "RDS.A"  "RDS.B"  "SBNB"   "SCQ"
##  [71] "STI.A"  "STI.B"  "STZ.B"  "TAL"    "TCB.W"  "TI.A"   "TUMI"
##  [78] "VALE.P" "VLY.W"  "WSO.B"  "ABEOW"  "ADXSW"  "AGFSW"  "ANDAR"
##  [85] "ANDAU"  "ANDAW"  "ARWAR"  "ARWAU"  "ARWAW"  "AUMAW"  "BBCN"
##  [92] "BHACR"  "BHACU"  "BHACW"  "BIND"   "BLVDW"  "BNTCW"  "BVXVW"
##  [99] "CADT"   "CADTW"  "CATYW"  "CERCW"  "CERCZ"  "CERE"   "CHEKW"
## [106] "CLACW"  "CLRBZ"  "CNYD"   "COWNL"  "COYNW"  "CPXX"   "CYHHZ"
## [113] "CYRXW"  "DRIOW"  "DRWIW"  "EACQU"  "EACQW"  "EAGLW"  "ECACR"
## [120] "ELECU"  "ELECW"  "EPRS"   "EYEGW"  "FNFG"   "FNTC"   "FNTCU"
## [127] "FNTCW"  "GFNSL"  "GGAC"   "GGACR"  "GIGA"   "GPACU"  "GPACW"
## [134] "GPIAW"  "GRSHW"  "HCACU"  "HCACW"  "HCAPL"  "HDRAR"  "HDRAU"
## [141] "HDRAW"  "HMPR"   "HNSN"   "HRMNW"  "IMOS"   "IRCP"   "JASNW"
## [148] "JSYNR"  "JSYNW"  "KLREW"  "KTOVW"  "KUTV"   "LBTYA"  "LDRH"
## [155] "LINDW"  "LMFAW"  "LPSB"   "MDVXW"  "MFLX"   "MIIIU"  "MRKT"
## [162] "NUROW"  "NXEOW"  "NXTDW"  "OACQR"  "OACQU"  "OACQW"  "OCLSW"
## [169] "ONSIW"  "ONSIZ"  "OPGNW"  "OXBRW"  "PAACR"  "PAACW"  "PACEW"
## [176] "PAVMU"  "PBIP"   "QDEL"   "QPACU"  "QPACW"  "RNVAW"  "SHLDW"
## [183] "SNFCA"  "SNHNL"  "SNI"    "SNTA"   "SOHOM"  "SQI"    "SRAQW"
## [190] "SRTSU"  "TACOW"  "TALL"   "TFSC"   "TFSCR"  "TFSCU"  "TFSCW"
## [197] "TRTLW"  "VKTXW"  "WIBC"   "WVVIP"  "WYIGW"  "ZAZZT"  "ZBZZT"
## [204] "ZCZZT"  "ZIONZ"  "ZJZZT"  "ZNWAA"  "ZVZZC"  "ZVZZT"  "ZWZZT"
## [211] "ZXYZ.A" "ZXZZT"

Quality Control

Here, there are 2 known issues I have come across while working on this code. One, is that there’s a symbol with the name “TRUE”, which not surprisingly, causes issues. The fix is to rename it to “TRUE.”.

The second issue is that the ticker OHGI appears to have “garbage” data, thus it is excluded completely from the replication.

OHGI Adjusted Closing Prices

It is possible that there are more anomalies within the data, however, from my perspective, there was a time-sensitive deliverable, and thus given that the numbers were sensible, insofar as they were comparable to results generated by the actual CRSP volatility decile indices, they were used for the replication project. Any additional investigations and QC should be performed at this stage.

#*********IMPORTANT
#clean-up known error-causing stocks
stockPriceEnv$TRUE. <-stockPriceEnv$`TRUE`
rm(`TRUE`, envir = stockPriceEnv)
rm(OHGI, envir = stockPriceEnv)

#This is the time for quality control, since it gets more challenging to fix any issues later on
#**********

This next block of code uses roughly 17 GB of RAM on my machine using a fork cluster with 8 cores, therefore, please ensure that there is adequate memory to run the code. The code checks that each stock has at least 80% of trading days populated within a year (as per CRSP’s requirements), and then calculates the annualized volatility for that stock for each year. Fortunately, the code only takes 1.5 minutes to run using parallel processing.

# Adjusted prices
# stock needs valid returns for eighty percent of trading days
# CRSP uses linear returns

#*****Warning, lots of RAM required

annualizationScaling <- 252
percentOfYear <- 0.8

startTime <- Sys.time()

cl <- makeForkCluster(nnodes = detectCores())
registerDoParallel(cl = cl)

stockVolatility <-
  foreach(stockName = ls(envir = stockPriceEnv), .errorhandling = "pass") %dopar%
  {

    stock <- get(x = stockName, envir = stockPriceEnv)

    typeIndex <- grep("Adjusted",colnames(stock))

    adjustedPrices <- stock[,typeIndex]

    #Compute linear returns
    stockLinearReturns <- retVectorL(adjustedPrices)

    #extract time index and slice into years
    yearsAvail <- unique(str_sub(string = index(stockLinearReturns), end=4))

    #check that stock has 80% of trading days
    validYearTable <-
      lapply(yearsAvail
             , function(year){
               numDaysInYear <- nrow(stockLinearReturns[year])                #need >= 80%
               validYearFlag <- ifelse(numDaysInYear >= floor(annualizationScaling*percentOfYear),1,0)
               return(c(as.numeric(year),validYearFlag))
             }
      )

    validYearTable <- do.call(rbind, validYearTable)

    #Take valid years
    validYears <- as.character(validYearTable[validYearTable[,2]==1,1])

    volatilityAnnual <-
      lapply(validYears
             , function(year){
               timeSeries <- stockLinearReturns[year]

               volatilityAnn <- sd(coredata(timeSeries))*sqrt(annualizationScaling)

               volYear <- as.yearmon(paste0(year,"-","12"))

               volatilityAnn <- xts(x = volatilityAnn, order.by = volYear)

               return(volatilityAnn)
             }
      )
    volatilityAnnual <- do.call(rbind, volatilityAnnual)

    return(volatilityAnnual)
  }

stopCluster(cl = cl)
cl <- NULL
gc()

endTime <- Sys.time()
endTime - startTime
#Time difference of 1.408692 mins
#On 8 cores

#name the list
names(stockVolatility) <- ls(envir = stockPriceEnv)
# http://www.crsp.com/products/documentation/stock-file-indexes-0

Volatiliy Decile Replication

Next, the data is checked for continuity, i.e. that there are no missing years.

#Grab min and max year from each stock
stockVolatilityYears <-
lapply(stockVolatility
       ,function(stockVolVector){
         years <- index(stockVolVector)
         years <- str_sub(string = years, start = 5)
         yearRange <- c(min(years)
                        ,max(years)
                        )
         return(yearRange)
       }

)
#merge min and max into a table
stockVolatilityYears <- do.call(rbind, stockVolatilityYears)

The plot below indicates a break before 1973, indicating that there are gaps in the data prior to that year.

##  [1] "1964" "1970" "1973" "1974" "1975" "1976" "1977" "1978" "1979" "1980"
## [11] "1981" "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989" "1990"
## [21] "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999" "2000"
## [31] "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010"
## [41] "2011" "2012" "2013" "2014" "2015"

Thus, the data is filtered to ensure that the time series begins in 1973.

stockVolatilityYearsSeq <- sort(unique(stockVolatilityYears[,1]))
#exclude first two entries, we want a continuous time series
stockVolatilityYearsSeq <- stockVolatilityYearsSeq[3:length(stockVolatilityYearsSeq)]

Next, stocks are assigned to deciles. In order to accomplish this, the code below extracts all volatilities for the stocks for each year…

#now build a list of volatility vectors

#loop through the years
cl <- makeForkCluster(nnodes = detectCores())
startTime <- Sys.time()

stockVolatilityTimeSeriesList <-
   parLapply(cl
  , stockVolatilityYearsSeq
  ,function(year){
    #here we'll loop through the stocks, and pull out all the years
    volList <-
    lapply(stockVolatility
           , function(stockVolVector){
             return(stockVolVector[year])
           }
    )

    volVector <- do.call(rbind, volList)

    return(volVector)

  }
)

stopCluster(cl = cl)
cl <- NULL
gc()

#set names
names(stockVolatilityTimeSeriesList) <- stockVolatilityYearsSeq

endTime <- Sys.time()
endTime - startTime
# Time difference of 53.64748 sec
# On 8 cores</code></pre>
…and then assigns the volatilities to deciles for each year.
<pre class="r"><code>decilesByYear <-
lapply(stockVolatilityTimeSeriesList
       , function(stockVolatilityTimeSeries){
                  quantile(x = stockVolatilityTimeSeries
                          , probs = seq(0,1,0.1)
                          , na.rm = TRUE
                          )
                  }
      )
decilesByYear <- do.call(rbind,decilesByYear)

years <- rownames(decilesByYear)
deciles <- colnames(decilesByYear)

By summarizing the results, outliers can clearly be observed in the more recent years. This indicates that the high volatility stocks can very likely also benefit from additional QC, however to keep this post brief, no additional QC will be conducted.

Volatility Decile by Year (1995 and onward)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1995 0.04 0.16 0.19 0.22 0.27 0.32 0.38 0.46 0.57 0.75 5.28
1996 0.10 0.16 0.20 0.24 0.28 0.33 0.40 0.50 0.63 0.79 31.40
1997 0.00 0.18 0.24 0.27 0.31 0.37 0.43 0.51 0.62 0.80 6.27
1998 0.00 0.22 0.30 0.35 0.40 0.45 0.53 0.61 0.73 0.95 23.28
1999 0.00 0.21 0.28 0.34 0.39 0.44 0.52 0.61 0.74 1.01 14.21
2000 0.08 0.23 0.33 0.40 0.46 0.53 0.63 0.76 0.97 1.29 92.44
2001 0.07 0.22 0.29 0.35 0.41 0.48 0.57 0.69 0.85 1.11 4.98
2002 0.06 0.21 0.28 0.34 0.38 0.45 0.52 0.62 0.79 1.02 8.00
2003 0.06 0.16 0.21 0.26 0.30 0.34 0.40 0.49 0.62 0.85 12.00
2004 0.07 0.15 0.20 0.23 0.27 0.31 0.36 0.44 0.54 0.71 36.57
2005 0.04 0.14 0.19 0.22 0.26 0.30 0.34 0.39 0.47 0.63 39.75
2006 0.03 0.13 0.18 0.22 0.26 0.30 0.35 0.41 0.47 0.61 4.79
2007 0.06 0.18 0.24 0.28 0.31 0.35 0.39 0.44 0.51 0.65 11434.96
2008 0.07 0.41 0.50 0.57 0.63 0.70 0.77 0.85 0.96 1.15 83.85
2009 0.05 0.29 0.38 0.45 0.52 0.59 0.67 0.77 0.91 1.18 103.81
2010 0.00 0.19 0.24 0.28 0.33 0.37 0.42 0.48 0.56 0.70 13.77
2011 0.04 0.21 0.27 0.33 0.37 0.42 0.47 0.53 0.62 0.76 188.01
2012 0.04 0.14 0.19 0.23 0.28 0.32 0.37 0.43 0.53 0.69 501.00
2013 0.01 0.14 0.18 0.21 0.24 0.28 0.32 0.38 0.48 0.66 38.60
2014 0.01 0.12 0.17 0.21 0.25 0.29 0.34 0.41 0.51 0.68 424.01
2015 0.03 0.15 0.21 0.24 0.28 0.33 0.39 0.47 0.58 0.75 1493.12

Now that volatility has been assigned to deciles for each year, stocks can be assigned deciles based on their volatilities for each year. This takes roughly 37 minutes on my machine using all 8 cores.

#now assign the deciles to each of the stocks by year
cl <- makeForkCluster(nnodes = detectCores())
startTime <- Sys.time()
registerDoParallel(cl)

stockDeciles <- foreach(stockVolVector = stockVolatility, .errorhandling = "pass") %dopar%
       {

         if(prod(is.na(stockVolVector))){return(NA)} else {

           stockDecileVector <- stockVolVector
           stockDecileVector <- xts(x=rep(NA,nrow(stockDecileVector)),order.by = index(stockDecileVector))

                  for(dec in deciles){
                      for(yr in years){
                                      stockDecileVector[yr] <- ifelse(stockVolVector[yr] <= decilesByYear[yr,dec] & is.na(stockDecileVector[yr]),dec,stockDecileVector[yr])
                                }
                  }
                  return(stockDecileVector)
         }
       }

stopCluster(cl = cl)
cl <- NULL
gc()

#assign names
names(stockDeciles) <- names(stockVolatility)

endTime <- Sys.time()
endTime - startTime
# Time difference of 37.39486 mins
# on 8 cores

So far, we have lost 212 stocks from a failure to download, and after running the code block above, we have also lost an additional 392 symbols due to NA or NULL values, which ultimately results in a loss of 9.81% of symbols.

Based on the results in the previous block of code, decile portfolios can be constructed for each year. The following code takes roughly 24 minutes to run on my machine (using 8 cores).

#assign stocks to decile portfolios
cl <- makeForkCluster(nnodes = detectCores())
startTime <- Sys.time()

decilePortfolioWeights <-
parLapply(cl
       , deciles[-1]
       , function(dec){

         matrixWeights <-
         lapply(years
                ,function(yr){

                            rowVectorWeights <-
                            lapply(names(stockDeciles)
                                   , function(stock){

                                     assignedStockDecile <- stockDeciles[[stock]][yr]

                                     #There will be an error if the stock previously didn't meet a condition to get assigned a volatility (too few obs)
                                     if(length(assignedStockDecile)==0 | class(stockDeciles[[stock]])[1] == "simpleError" | prod(is.na(stockDeciles[[stock]]))){      dummyResult <- xts(x = 0
                                                                                            , order.by = as.yearmon(paste0(yr,"-12"))
                                                                                            )
                                                                              colnames(dummyResult) <- stock
                                                                              return(dummyResult)

                                                                               } else {
                                      assignedWeight <- ifelse(assignedStockDecile == dec,1,0)
                                     }
                                     colnames(assignedWeight) <- stock
                                     return(assignedWeight)
                                   }

                            )
                            rowVectorWeights <- do.call(cbind, rowVectorWeights)
                            return(rowVectorWeights)
                }
         )

         matrixWeights <- do.call(rbind, matrixWeights)
         return(matrixWeights)

       }
       )

stopCluster(cl = cl)
cl <- NULL
gc()

#assign names
names(decilePortfolioWeights) <- deciles[-1]

endTime <- Sys.time()
endTime - startTime

# Time difference of 24.0372 mins
# on 8 cores

The code above generates a matrix of 1/0 flags, where the rows are the years, and the columns are the stocks. Two things now need to happen to proceed. First, the date needs to be converted to an end-of-year date, since CRSP assigns the portfolio weights in the subsequent year (later, PerformanceAnalytics::Return.portfolio() will then assign the weights to the subsequent year). Second, the 1/0 flags need to be converted to weights, so that they sum to 1 for each year (each row).

#convert to date at end of year
decilePortfolioWeights <- lapply(decilePortfolioWeights
                                 , function(decilePortfolio){
                                   decilePortfolioIDX <- index(decilePortfolio)
                                   decilePortfolioIDX <- as.Date(timeLastDayInMonth(as.Date(decilePortfolioIDX)))
                                   decilePortfolio <- xts(x = coredata(decilePortfolio)
                                                          , order.by = decilePortfolioIDX
                                                          )
                                   return(decilePortfolio)
                                 }
                                 )

#Now convert them from flags to weights
decilePortfolioWeights <- lapply(decilePortfolioWeights
                                 , function(decilePortfolio){
                                   decilePortfolioIDX <- index(decilePortfolio)
                                   decilePortfolioMatrix <- coredata(decilePortfolio)

                                   sumVector <- apply(decilePortfolioMatrix,1,sum)

                                   scaledDecilePortfolioMatrix <- lapply(1:length(sumVector)
                                                                         ,function(rowNum){
                                                                           scaledVector <- decilePortfolioMatrix[rowNum,]/sumVector[rowNum]
                                                                           return(scaledVector)
                                                                         }
                                                                         )
                                   scaledDecilePortfolioMatrix <- do.call(rbind, scaledDecilePortfolioMatrix)

                                   decilePortfolio <- xts(x = coredata(scaledDecilePortfolioMatrix)
                                                          , order.by = decilePortfolioIDX
                                   )
                                   return(decilePortfolio)
                                 }
)

Some checks are performed to ensure that the weights sum to 1 across all years.

Verification of Weight-Sums Across Deciles and Years

#check, should see 1's across the board
xts(
  do.call(cbind,
lapply(decilePortfolioWeights
      , function(x){apply(coredata(x),1,sum)}
))
, index(decilePortfolioWeights[[1]])
)
##            10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1973-12-31   1   1   1   1   1   1   1   1   1    1
## 1974-12-31   1   1   1   1   1   1   1   1   1    1
## 1975-12-31   1   1   1   1   1   1   1   1   1    1
## 1976-12-31   1   1   1   1   1   1   1   1   1    1
## 1977-12-31   1   1   1   1   1   1   1   1   1    1
## 1978-12-31   1   1   1   1   1   1   1   1   1    1
## 1979-12-31   1   1   1   1   1   1   1   1   1    1
## 1980-12-31   1   1   1   1   1   1   1   1   1    1
## 1981-12-31   1   1   1   1   1   1   1   1   1    1
## 1982-12-31   1   1   1   1   1   1   1   1   1    1
## 1983-12-31   1   1   1   1   1   1   1   1   1    1
## 1984-12-31   1   1   1   1   1   1   1   1   1    1
## 1985-12-31   1   1   1   1   1   1   1   1   1    1
## 1986-12-31   1   1   1   1   1   1   1   1   1    1
## 1987-12-31   1   1   1   1   1   1   1   1   1    1
## 1988-12-31   1   1   1   1   1   1   1   1   1    1
## 1989-12-31   1   1   1   1   1   1   1   1   1    1
## 1990-12-31   1   1   1   1   1   1   1   1   1    1
## 1991-12-31   1   1   1   1   1   1   1   1   1    1
## 1992-12-31   1   1   1   1   1   1   1   1   1    1
## 1993-12-31   1   1   1   1   1   1   1   1   1    1
## 1994-12-31   1   1   1   1   1   1   1   1   1    1
## 1995-12-31   1   1   1   1   1   1   1   1   1    1
## 1996-12-31   1   1   1   1   1   1   1   1   1    1
## 1997-12-31   1   1   1   1   1   1   1   1   1    1
## 1998-12-31   1   1   1   1   1   1   1   1   1    1
## 1999-12-31   1   1   1   1   1   1   1   1   1    1
## 2000-12-31   1   1   1   1   1   1   1   1   1    1
## 2001-12-31   1   1   1   1   1   1   1   1   1    1
## 2002-12-31   1   1   1   1   1   1   1   1   1    1
## 2003-12-31   1   1   1   1   1   1   1   1   1    1
## 2004-12-31   1   1   1   1   1   1   1   1   1    1
## 2005-12-31   1   1   1   1   1   1   1   1   1    1
## 2006-12-31   1   1   1   1   1   1   1   1   1    1
## 2007-12-31   1   1   1   1   1   1   1   1   1    1
## 2008-12-31   1   1   1   1   1   1   1   1   1    1
## 2009-12-31   1   1   1   1   1   1   1   1   1    1
## 2010-12-31   1   1   1   1   1   1   1   1   1    1
## 2011-12-31   1   1   1   1   1   1   1   1   1    1
## 2012-12-31   1   1   1   1   1   1   1   1   1    1
## 2013-12-31   1   1   1   1   1   1   1   1   1    1
## 2014-12-31   1   1   1   1   1   1   1   1   1    1
## 2015-12-31   1   1   1   1   1   1   1   1   1    1

Here is another check, where the number of stocks for each decile by year is computed. What we should observe, and do ultimately see, is an approximately equal distribution of stocks across all deciles within a year. Or in other words, the number of stocks across each row (year) should nearly all be equal.

Count of Stocks Across Deciles and Years

#Number of stocks in each portfolio by year
do.call(cbind,
lapply(decilePortfolioWeights
       , function(x){vector <- apply(coredata(x),1,function(y){length(which(y>0))})
                      names(vector) <- years
                      return(vector)
         }
))
##      10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1973   4   5   5   5   5   4   5   5   5    5
## 1974   7   8   7   8   7   8   7   8   7    8
## 1975   7   8   7   8   8   7   8   7   8    8
## 1976   7   8   8   7   8   8   7   8   8    8
## 1977   8   9   9   8   9   9   8   9   9    9
## 1978   8   9   9   9   9   9   9   9   9    9
## 1979   9   9   9   9   9   9   9   9   9    9
## 1980  10  10  10  10  10  10  10  10  10   10
## 1981  18  19  19  19  19  19  19  19  19   19
## 1982  22  22  22  22  23  22  22  22  22   23
## 1983  22  23  23  23  23  22  23  23  23   23
## 1984  26  27  27  26  27  27  26  27  27   27
## 1985  35  35  35  35  36  35  35  35  35   36
## 1986  38  40  40  39  40  40  39  40  40   40
## 1987  44  44  44  44  45  44  44  44  44   45
## 1988  62  62  62  62  62  62  62  62  62   62
## 1989  64  65  65  65  65  65  65  65  65   65
## 1990  72  72  72  73  72  72  73  72  72   73
## 1991  95  95  95  95  95  95  95  95  95   95
## 1992 115 115 115 116 115 115 116 115 115  116
## 1993 129 129 130 129 130 129 129 130 129  130
## 1994 146 146 146 147 146 146 147 146 146  147
## 1995 161 162 162 161 162 162 161 162 162  162
## 1996 182 182 182 182 183 182 182 182 182  183
## 1997 201 201 201 201 202 201 201 201 201  202
## 1998 215 216 216 216 216 215 216 216 216  216
## 1999 231 232 231 232 232 231 232 231 232  232
## 2000 250 251 250 251 251 250 251 250 251  251
## 2001 264 265 264 265 264 265 264 265 264  265
## 2002 272 272 273 272 273 272 272 273 272  273
## 2003 284 285 285 285 285 284 285 285 285  285
## 2004 301 301 302 301 302 301 301 302 301  302
## 2005 320 320 320 321 320 320 321 320 320  321
## 2006 339 339 339 339 340 339 339 339 339  340
## 2007 358 358 358 358 358 358 358 358 358  359
## 2008 383 383 383 383 384 383 383 383 383  384
## 2009 395 396 395 396 396 395 396 395 396  396
## 2010 406 406 407 406 407 406 406 407 406  407
## 2011 429 430 429 430 430 429 430 429 430  430
## 2012 451 452 451 452 452 451 452 451 452  452
## 2013 477 477 477 478 477 477 478 477 477  478
## 2014 512 513 513 512 513 513 512 513 513  513
## 2015 554 555 555 555 555 555 555 555 555  555

We are almost ready to compute the volatility decile portfolios. The column names across all volatility deciles are verified to ensure that they are equal across all years. This will be used to match weights to the stock returns when computing a portfolio.

#Make sure they match decilePortfolioWeights names

#Should return empty values (since it is looking for where it is NOT TRUE)
which(
!apply(
  do.call(rbind,lapply(decilePortfolioWeights
       , function(x){colnames(x)}
))
,2
,function(y){isTRUE(all.equal( max(y) ,min(y)) )
  #This function taken from here:
  #http://stackoverflow.com/questions/4752275/test-for-equality-among-all-elements-of-a-single-vector
  }
))
## integer(0)

It returns an empty result, indicating that any one of the list item’s names can be used as a reference.

#So we can take the first since we know they're all equal
stockNames <- colnames(decilePortfolioWeights[[1]])

Next, the “stockNames” variable contents are matched to the stock price names so that matrix multiplication can be performed later (i.e. the order of the weights match the order of the stock returns).

stockEnvIndex <-
do.call(c,
        lapply(stockNames
               ,function(stock){which(stock == names(as.list(stockPriceEnv)))})
)

Now, a very large matrix of prices, in the same order as the stock weights, will be created. Ideally, the xts::merge() function would be used. However, since there is a large volume of data, my machine cannot handle it. The way around it, is to loop through all the stocks and join them together, very slowly. The code below runs in sequence (one core) and takes just over 30 minutes on my machine.

#CRSP uses adjusted prices (for dividends)
#http://www.crsp.com/files/data_descriptions_guide_0.pdf #p.102
adjustedColumn <- grep(".Adjusted",colnames(as.list(stockPriceEnv)[[stockEnvIndex[1]]]))

#First one...
stockPriceMatrix <- as.list(stockPriceEnv)[[stockEnvIndex[1]]][,adjustedColumn]
colnames(stockPriceMatrix) <- gsub(".Adjusted","",colnames(stockPriceMatrix))
##head(stockPriceMatrix)

startTime <- Sys.time()
#the rest
for(stockIDX in 2:length(stockEnvIndex)){
  stockVector <- as.list(stockPriceEnv)[[stockEnvIndex[stockIDX]]][,adjustedColumn]
  colnames(stockVector) <- gsub(".Adjusted","",colnames(stockVector))
  #head(stockVector)
  stockPriceMatrix <- merge(stockPriceMatrix,stockVector)
}
rm(stockVector)
endTime <- Sys.time()
endTime - startTime
# Time difference of 30.50445 mins
# on 1 core

So we have a matrix of prices, in the same order as the weights that were calculated. The code below verifies that the order is the same across all decile portfolios.

#check that names and order of names match
do.call(c,
lapply(decilePortfolioWeights
       ,function(col){
identical(
colnames(stockPriceMatrix)
, colnames(col)
)
       })
)
##  10%  20%  30%  40%  50%  60%  70%  80%  90% 100%
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Returns then need to be calculated from the matrix of prices.

#Now we can calculate linear returns
stockLRetMatrix <-retMatrixL(stockPriceMatrix)

Finally, the decile portfolio returns can be calculated, using the stock returns and weights that were calculated previously.

decilePortfolioMatrix <-
  do.call(merge,
          lapply(decilePortfolioWeights
                 ,function(portfolioWeights){
                   Return.rebalancing(R = stockLRetMatrix
                                      , weights = portfolioWeights
                   )

                 }
          ))
names(decilePortfolioMatrix) <- names(decilePortfolioWeights)

Below, the cumulative returns are plotted on a log-scale, which facilitates a visual comparison of the decile performance. We can see that the higher volatility deciles exhibit higher cumulative returns, which is what is expected as the volatility increases.

Cumulative Returns of Volatility Decile Portfolios (Log Scale)

Numeric results can be generated using the following line of code.

computeTimeSeriesStatistics(linearReturnMatrix = decilePortfolioMatrix)

The Sharpe Ratio declines as the volatility increases, which is consistent with the low-volatility anomaly. However, the peculiar result is that the Sharpe ratio then increases at the high volatility portfolios. One reason for this could be the embedded survivorship bias, since riskier stocks in the high volatility portfolios would be excluded, thereby increasing risk-adjusted returns. However, if we compare the Sharpe-Ratio-trend against actual CRSP data results, seen in the table below, they exhibit the initial decline in Sharpe Ratio, and subsequent rise in Sharpe Ratio for the higher volatility deciles, which conflicts with the low-volatility anomaly results. Overall, replicated volatility decile returns, risk, and tail risk increase as the volatility increases, which is consistent with expectations, and which is observed in the actual CRSP volatility decile performance.

Replicated Volatility Decile Results

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Annualized Return 9.77 11.92 11.85 12.60 13.51 14.48 15.22 16.69 18.85 33.46
Annualized Standard Deviation 8.62 11.48 14.13 14.93 15.80 16.80 17.83 19.08 20.15 25.30
Semi-Deviation 6.28 8.43 10.20 10.84 11.52 12.18 12.93 13.91 14.79 17.65
Annualized Sharpe Ratio (Rf=0%) 1.13 1.04 0.84 0.84 0.86 0.86 0.85 0.87 0.94 1.32
Adjusted Sharpe ratio (Risk free = 0) -3.10 -0.30 0.22 0.35 0.43 0.51 0.53 0.57 0.57 0.33
Average Drawdown 1.13 1.40 1.78 2.04 2.18 2.56 2.84 3.01 3.33 4.05
Average Recovery 11.73 10.07 10.50 10.75 10.63 11.08 12.39 12.60 13.50 13.81
VaR 95 -0.68 -1.01 -1.27 -1.39 -1.43 -1.57 -1.66 -1.83 -1.97 -2.29
VaR 99 -1.51 -2.05 -2.39 -2.47 -2.70 -2.86 -2.97 -3.14 -3.47 -4.08
ETL 95 -1.26 -1.71 -2.05 -2.16 -2.31 -2.45 -2.60 -2.78 -2.98 -3.56
ETL 99 -2.44 -3.07 -3.68 -3.80 -4.07 -4.22 -4.41 -4.67 -4.98 -6.03
Worst Loss -11.13 -13.77 -15.23 -15.98 -15.02 -15.04 -15.54 -15.44 -16.08 -12.95
Skewness -0.02 -0.77 -0.28 -0.61 -0.58 -0.44 -0.42 -0.46 -0.54 0.45
Excess Kurtosis 71.10 25.61 22.85 16.06 13.08 10.96 10.31 8.76 8.26 12.49

Actual Volatility Decile Results

Table extracted from:

Han, Yufeng, Ke Yang, and Guofu Zhou. 2013. “A New Anomaly: The Cross-Sectional Profitability of Technical Analysis.” Journal of Financial and Quantitative Analysis 48 (5). Cambridge University Press: 1433–61. http://dx.doi.org/10.1017/S0022109013000586.

Conclusion

This post presents R code which uses free Yahoo! Finance data to replicate the performance of CRSP volatility deciles. Though the replicated portfolios contain survivorship bias, the overall risk and return characteristics match those exhibited by actual CRSP volatility decile portfolios.

© 2016 Erol Biceroglu

An EMA Trading Strategy for a Low Volatility Portfolio

Introduction

The process I’m going to follow is based on content from the University of Washington’s CFRM561 course Advanced Trading System Design. “Hypothesis driven development” is the core principle of this course, where each step in the development process involves hypothesizing testable ideas, and verifying these ideas before proceeding to the next stage. The stages involve identifying one or more market indicators, testing that the indicators are actually measuring the intended market phenomena, hypothesizing entry and exit signals based on the market indicator(s), confirming if the signals have predictive power, and then setting entry and exit rules based on the signals.

I am personally a fan of low volatility investing, insofar as it provides enhanced risk-adjusted returns vs. traditional cap-weighted market indices. Thus, I have reconstructed the CRSP lowest volatility decile portfolio using data from Yahoo! Finance, and will use this index as the baseline for evaluating the trading strategy. The goal here is to verify if a Exponentially-Weighed Moving Average [EMA] trading strategy can further enhance risk-adjusted returns by either increasing returns without increasing risk, decreasing risk without impacting returns, or both.

This post is a significantly condensed version of the full report. The full-version report, which is the actual assignment I submitted, can be found here. It contains all the details for each step, results of the hypothesis tests, including the signal strength, as well as the interim parameter optimization steps.

Indicator

The first question related to the indicator that we need to ask ourselves is, “what do we think we’re measuring?”

It can be argued that the EMA is measuring the “true de-noised price level”, where the thinking is that the moving average “averages” the noise away, thereby revealing the true level.

It can also be argued that, assuming the EMA is measuring the “true” level, it is also measuring the price trend. The rationale here is that, if there’s a true level revealed, then the trend can be inferred from the time series of “true levels”.

In order to test that EMA is measuring the true level, the correlation between the low-vol index level and EMA indicator is tested for strength and statistical significance. To test the trend, it is compared against a measure of “true” slope by testing for high and statistically significant correlation and cointegration.

Signal

Prior to defining any signal, the main prediction that might be possible with an EMA is stating whether or not the future index level will be greater than or less than the current index level, instead of generating a point-forecast. For this tool, confusion matrices are used to measure if the future-index-level-forecasts are correctly classified as higher or lower than current index levels. The statistical significance of the confusion matrices are also verified to ensure that there is some information content.

In terms of the actual signal, the thought behind it is that if the current index level exceeds the EMA level, this will increase the subsequent EMA value (which was verified as a measure of “true” price level), which will ultimately increase the EMA trend. This would then be interpreted as a potential signal that the index level will rise.

The rationale for the converse stands, in that, if the current index level is less than the EMA level, this will decrease the subsequent EMA value (which is a measure of “true” price level), thereby indicating that the index level will potentially decrease.

Rule

The rules are simple: if it is likely that the price will increase based on the relationship between indicator and index, (i.e. the index level exceeds the EMA level) then a long position should be initiated. Conversely, if it is likely that the price will decrease based on the same relationship, then any long position(s) should be closed.

Benchmarks and Objectives of Strategy

Objectives

The main goal is to verify if the risk-adjusted returns for the low volatility index can be enhanced. Thus, maximizing a measure of risk-adjusted performance is the objective of this strategy, since leverage can always be applied to increase absolute returns.

MAX(return)

\; \; \; \; s.t. MIN(risk)

Given that the objective is to maximize risk-adjusted returns, the following statistics can be used to quantify risk-adjusted returns:

  • Sharpe Ratio [SR]
  • Profit-to-Max Drawdown [PMD]

During the trading simulations, PMD will be used to assess risk-adjusted returns as it is more sensitive to the tails than the SR. However, in order to evaluate the viability of the trading strategy, the SR will be used.

Benchmarks

Results of the EMA trading strategy will be compared against a 100% long version of the low volatility index. This portfolio is the ‘buy-and-hold’ [BH] version of the index, whereas the trading strategy will be referred to as the Moving-Average Portfolio [MAPortfolio]. The BH portfolio is the “no effort” portfolio with which to measure if the trading signals are adding value. Success is not defined on absolute returns; it may be possible to observe absolute lower returns, however exhibit higher risk-adjusted returns.

Results

Signal Strength Results

The Heidke Skill Score [HSS] is used to quantify signal strength, as it is more unforgiving than the “Probability of Detection”.

HSS for the up signal indicates that the skill level is relatively low for lag days of 15 or greater. Moreover, the skill level is highest for 1 day-forward with a 5 day lag, and drops off relatively rapidly for a 5-day lag when the number of forward days exceed 3 days.

The HSSs are much higher for the down signal, and appear to persist for 1 – 5 forward days, and lag days up to 60 days, assuming that less than 1% skill is not enough to make useful predictions. To clarify, the 60-day-persistence isn’t visible on the contour plot, however numerically the HSSs are more persistent for the down signal vs. the up signal. So far, this implies that the down signal contains more predictive power than the up signal.

All skill scores determined to be “strong” are also statistically significant, both for up and down signals.

Trading Strategy Results

After testing various lag-lengths exhibiting high HSSs for both up and down signals, as well as performing a parameter optimization, the conclusion detailed in the full report was that it is likely more reliable to set the lag length based only on the HSS strength instead of the backtest results due to poor out-of-sample performance [OOS]. The OOS results, as well as the code, for the high HSS lag lengths (5 days for the up signal and 5 days for the down signal) are displayed below.

decilePortfolioMatrixLevel <- do.call(merge,
                                      lapply(decilePortfolioMatrix
                                             ,retToLevel))



filterRange <- "1995/"

MAPortfolioALL <- decilePortfolioMatrixLevel[filterRange]
startDate<-as.Date(index(MAPortfolioALL)[1], format="%Y-%m-%d")
endDate<-as.Date(index(last(MAPortfolioALL)), format="%Y-%m-%d")

currency("USD")

stock(primary_id = c("MAPortfolio"),currency = "USD")

#name
maStrategy <-"MAStrategy"

#Date, one day before prices
strategyDate <- min(index(MAPortfolioALL)) - 1

#*******************************
#Used to reset for testing
#rm.strat(maStrategy)
#rm(mktdata)
#if (!exists('.blotter')) .blotter <- new.env()
#if (!exists('.strategy')) .strategy <- new.env()
#*******************************

NumSh<-1
NumDeciles <- ncol(MAPortfolioALL)
maxDecile<-1

MAPortfolio <- MAPortfolioALL[,maxDecile]
colnames(MAPortfolio) <- "MAPortfolio.Close"

maLag <- 5
maLagDown <- 5


#init portfolio and account
initPortf(name = maStrategy
          , symbols = list("MAPortfolio") #as defined in Financial instrument
          , initDate = strategyDate
)

initAcct(name = maStrategy
         ,portfolios = maStrategy
         ,initDate = strategyDate
         ,initEq = 1 
)

#order book, and strategy
initOrders(portfolio = maStrategy
           , initDate = strategyDate
)

#position limits
addPosLimit(maStrategy, symbol = "MAPortfolio", strategyDate, maxpos = NumSh, longlevels = NumSh, minpos = 0)

strategy( maStrategy, store = TRUE)

#add indicator
add.indicator(strategy = maStrategy
              , name = "EMA"
              , arguments = list(x = quote(mktdata$MAPortfolio.Close), n = quote(maLag))
              , label = "MAPortfolioma"
)

add.indicator(strategy = maStrategy
              , name = "EMA"
              , arguments = list(x = quote(mktdata$MAPortfolio.Close), n = quote(maLagDown))
              , label = "MAPortfoliomaDown"
)

#MAPortfolio is greater than N-Lag Moving Average
add.signal(strategy = maStrategy
           , name = "sigComparison"
           , arguments = list(columns = c("MAPortfolio.Close","MAPortfolioma")       
                              , relationship = "gt"
           )
           , label = "MAPortfolio.gt.ma"
)


#MAPortfolio is less-than-equal-to N-Lag Moving Average
add.signal(strategy = maStrategy
           , name = "sigComparison"
           , arguments = list(columns = c("MAPortfolio.Close","MAPortfoliomaDown")
                              , relationship = "lte"
           )
           , label = "MAPortfolio.lte.ma"
)




#Entry and exit rules

#buy MAPortfolio when above N-Lag Moving Average
add.rule(strategy = maStrategy
         , name = "ruleSignal"
         , arguments = list(sigcol = "MAPortfolio.gt.ma"
                            , sigval = TRUE
                            , orderqty = NumSh
                            , ordertype = "market"
                            , orderside = NULL
                            , osFUN = "osMaxPos"
                            , symbol = "MAPortfolio"
         )
         , type = "enter"
)

#sell MAPortfolio when above N-Lag Moving Average
add.rule(strategy = maStrategy
         , name = "ruleSignal"
         , arguments = list(sigcol = "MAPortfolio.lte.ma"
                            , sigval = TRUE
                            , orderqty = "all"
                            , ordertype = "market"
                            , orderside = NULL
                            , osFUN = "osMaxPos"
                            , symbol = "MAPortfolio"
         )
         , type = "exit"
)

applyStrategy(strategy = maStrategy
              , portfolios = maStrategy
)

updatePortf(maStrategy)
updateAcct(maStrategy)
updateEndEq(maStrategy)

BH MAPortfolio
Annualized Return 8.66 10.18
Annualized Standard Deviation 7.62 4.19
Semi-Deviation 5.61 3.04
Annualized Sharpe Ratio (Rf=0%) 1.14 2.43
Adjusted Sharpe ratio (Risk free = 0) -7.62 -18.36
Average Drawdown 0.84 0.49
Average Recovery 9.33 7.05
VaR 95 NA -0.25
VaR 99 -16.93 -2.99
ETL 95 -25.02 -0.25
ETL 99 -16.93 -2.99
Worst Loss -6.64 -4.17
Skewness 1.88 -0.84
Excess Kurtosis 152.83 37.56
## [1] "t-test p-value:  0.52"
## [1] "F-test p-value:  0"

One key observation is that the performance during the 2008 financial crisis was relatively spectacular, with a drawdown of only approximately 10%, vs. 40% for the BH index. During the start of the OOS performance, there was some underperformance, up until the dot-com crash in late 1999 and early 2000, where drawdowns for the MA portfolio were minimal, allowing returns to be maintained.

Mean returns are not statistically significantly different, however variance is statistically significantly lower, resulting in the superior Sharpe Ratio of 2.43, vs. 1.14 for the BH Index. Unfortunately, the trading strategy does not exhibit positive skew, whereas, surprisingly, the BH Index exhibits positive skew. However, excess kurtosis, though high for both portfolios, is significantly higher for the BH index. It appears as though the trading strategy is significantly reducing the impact of negative market events, allowing the preservation of accumulated returns, with the drawback of potentially missing out on any upside. Given that the objective is to enhance risk-adjusted returns, the biased parameter combination appears to successfully meet this criteria. One additional factor is the transaction costs, which were not considered in this study. These would need to be factored in to the backtests to verify that superior risk-adjusted returns are maintained.

© 2016 Erol Biceroglu