Value-at-risk (VaR) models are only useful if they predict the risks with a certain degree of accuracy. This is why the application of these models always should be accompanied by validation. Model validation is the general process of checking whether a model is adequate. This can be done with a set of tools, including backtesting, stress testing, and independent review and oversight.
Backtesting is a formal statistical framework that consists of verifying that actual losses are in line with projected losses. This involves systematically comparing the history of VaR forecasts with their associated portfolio returns.
Importance of Backtesting
Backtesting is essential for VaR users and risk managers, who need to check that their VaR forecasts are well calibrated. If not, the models should be reexamined for faulty assumptions, wrong parameters, or inaccurate modeling. Then the models should be improved based on the ideas provided by the backtesting process.
Backtesting is also central to the Basel Committee’s ground-breaking decision to allow internal VaR models for capital requirements. The backtesting framework should be designed to maximize the probability of catching banks that deliberately understate their risk, but at the same time, the framework should also ensure that banks whose VaR is exceeded simply because of bad luck should not be penalized.
Setup For Backtesting
When the model is perfectly calibrated, the number of observations falling outside VaR should be in line with the confidence level. The number of exceedences is also known as the number of exceptions. With too many exceptions, the model underestimates risk. This is a major problem because too little capital may be allocated to risk-taking units; penalties also may be imposed by the regulator. Too few exceptions are also a problem because they lead to excess, or inefficient, allocation of capital across units.
An Example
An example of model calibration is described in this figure which displays the fit between actual and forecast daily VaR numbers for Bankers Trust. The diagram shows the absolute value of the daily profit and loss (P&L) against the 99 percent VaR, defined here as the daily price volatility. Observations that lie above the diagonal line indicate days when the absolute value of the P&L exceeded the VaR.
Assuming symmetry in the P&L distribution, about 2 percent of the daily observations (both positive and negative) should lie above the diagonal, or about 5 data points in a year. Here we observe four exceptions. Thus the model seems to be well calibrated. We could have observed, however, a greater number of deviations simply owing to bad luck. The question is: At what point do we reject the model?
Data Issue in Backtesting
VaR measures assume that the current portfolio is “frozen” over the horizon. In practice, the trading portfolio evolves dynamically during the day. Thus the actual portfolio is “contaminated” by changes in its composition. The actual return corresponds to the actual P&L, taking into account intraday trades and other profit items such as fees, commissions, spreads, and net interest income.
This contamination will be minimized if the horizon is relatively short, which explains why backtesting usually is conducted on daily returns.
Sometimes an approximation is obtained by using a cleaned return, which is the actual return minus all non-mark-to-market items, such as fees, commissions, and net interest income. Under the latest update to the market-risk amendment, supervisors will have the choice to use either hypothetical or cleaned returns.
For verification to be meaningful, the risk manager should track both the actual portfolio return Rt and the hypothetical return Rt∗ that most closely matches the VaR forecast. The hypothetical return Rt∗ represents a frozen portfolio, obtained from fixed positions applied to the actual returns on all securities, measured from close to close.
Ideally, both actual and hypothetical returns should be used for backtesting because both sets of numbers yield informative comparisons. If, for instance, the model passes backtesting with hypothetical but not actual returns, then the problem lies with intraday trading. In contrast, if the model does not pass backtesting with hypothetical returns, then the modeling methodology should be reexamined.
Model Backtesting With Exceptions
Model backtesting involves systematically comparing historical VaR measures with the actual returns. Since VaR is reported only at a specified confidence level, the actual returns should go beyond the VaR some times. For example, for a 95 percent confidence level VaR, we expect the actual returns should go beyond the VaR 5 percent of the time.
But most of the time we will not observe exactly 5 percent exceptions. A greater percentage could occur because of bad luck, perhaps 8 percent. At some point, if the frequency of deviations becomes too large, say, 20 percent, the user must conclude that the problem lies with the model, not bad luck, and undertake corrective action.
The issue is how to decide whether the deviations were because of bad luck or because of model flaws. This can be formulated as an accept or reject decision and is a classic statistical decision problem. This decision of accept or reject must be made at some confidence level. The choice of this level for the test, however, is not related to the quantitative level p selected for VaR.
Model Backtesting with Exceptions – Example
A 97 percent VaR model can be tested at 95 percent confidence.
Interpretation –
The exceptions can be expected to occur 3 percent of the time.
Through the backtesting process, we want to be 95 percent confident that the number of observed exceptions is different from the expected number of exceptions (3% of n) because of sampling variation only; or simply stated, we want to be 95 percent confident that the model is correct.
A 95 percent VaR model can be tested at 97 percent confidence.
Interpretation –
The exceptions can be expected to occur 5 percent of the time.
Through the backtesting process, we want to be 97 percent confident that the number of observed exceptions is different from the expected number of exceptions (5% of n) because of sampling variation only; or simply stated, we want to be 97 percent confident that the model is correct.
Model Verification Based on Failure Rates – Example
The simplest method to verify the accuracy of the model is to record the failure rate, which gives the proportion of times VaR is exceeded in a given sample.
Suppose a bank provides a VaR figure at the 1 percent left-tail level (p = 1 – c=0.01) for a total of T days. The number of exceptions is the number of days the actual loss exceeds the VaR level, which is, say N. Hence the failure rate will be N/T. Ideally, the failure rate should give an unbiased measure of p, that is, should converge to p as the sample size increases (i.e. N/T→p as T→∞)
We want to know whether the model is correct or not at a given confidence level, by assessing if N is too small or too large. So our null hypothesis is
H0 : Model is Correct
This test makes no assumption about the return distribution. The distribution could be normal, or skewed, or with heavy tails, or time-varying. We simply count the number of exceptions. As a result, this approach is fully nonparametric.
The setup for this test is the classic testing framework for a sequence of success and failures, also called Bernoulli trials. Under the null hypothesis that the model is correctly calibrated, the number of exceptions x follows a binomial probability distribution:
Jorion performs the test with 1 year of data (i.e. 250 trading days), and so T = 250.
Since it’s a 99 percent confidence level VaR, so we know that p = 1-0.99 = 0.01
So, for example, if we have to find the probability of getting 3 exceptions, we will have
CASE 1 – WHEN MODEL IS CORRECT ( p = 0.01)
Number of Exceptions, x
Probability, f(x)
Cumulative Probability, F(x)
0
8.1059%
8.1059%
1
20.4693%
28.5752%
2
25.7417%
54.3169%
3
21.4948%
75.8117%
4
13.4071%
89.2188%
5
6.6629%
95.8817%
6
2.7482%
98.6299%
7
0.9676%
99.5975%
8
0.2969%
99.8943%
9
0.0806%
99.9750%
10
0.0196%
99.9946%
11
0.0043%
99.9989%
12
0.0009%
99.9998%
13
0.0002%
100.0000%
14
0.0000%
100.0000%
15
0.0000%
100.0000%
CASE 1 – WHEN MODEL IS INCORRECT (p = 0.03)
Number of Exceptions, x
Probability, f(x)
Cumulative Probability, F(x)
0
0.0493%
0.0493%
1
0.3813%
0.4306%
2
1.4681%
1.8986%
3
3.7534%
5.6520%
4
7.1682%
12.8202%
5
10.9074%
23.7276%
6
13.7749%
37.5025%
7
14.8501%
52.3526%
8
13.9507%
66.3032%
9
11.6016%
77.9048%
10
8.6474%
86.5521%
11
5.8351%
92.3873%
12
3.5943%
95.9816%
13
2.0352%
98.0168%
14
1.0655%
99.0823%
15
0.5185%
99.6008%
Model Verification Based on Failure Rates
Expected value of x, or E[x]= p×T and Variance of x, or V(x) = p×(1 – p)×T. When T is large, we can use the central limit theorem and approximate the binomial distribution by the normal distribution
This provides a convenient shortcut. If the decision rule is defined at the two-tailed 95 percent test confidence level, then the cutoff value of |z| is 1.96.
Example – JP Morgan’s Exceptions
Model Verification Based on Failure Rates – Errors
When designing a verification test, the user faces a trade-off between the two types of errors. This table summarizes the two states of the world, correct versus incorrect model, and the decision.
For backtesting purposes, users of VaR models need to balance type 1 errors against type 2 errors. Ideally, one would want to set a low type 1 error rate and then have a test that creates a very low type 2 error rate, in which case the test is said to be powerful. It should be noted that the choice of the confidence level for the decision rule is not related to the quantitative level p selected for VaR. This confidence level refers to the decision rule to reject the model.
Kupiec (1995) develops approximate 95 percent confidence regions for such a test, defined by the tail points of the log-likelihood ratio:
which is asymptotically (i.e., when T is large) distributed chi-square with one degree of freedom under the null hypothesis that p is the true probability. Thus we would reject the null hypothesis if LR > 3.841.
Example – JP Morgan’s Exceptions
Kupiec’s approximate 95 percent confidence regions are reported in this table
For instance, with 2 years of data (T = 510), and VaR confidence level of 99% (p=0.01), we would expect to observe N = p×T =0.01×510≈5 exceptions. But the VaR user will not be able to reject the null hypothesis as long as N is within the [1 < N < 11] confidence interval. Values of N greater than or equal to 11 indicate that the VaR is too low or that the model understates the probability of large losses. Values of N less than or equal to 1 indicate that the VaR model is overly conservative.
The previous table also shows that this interval, expressed as a proportion N/T, shrinks as the sample size increases. Select, for instance, the p = 0.05 row.
With more data, we should be able to reject the model more easily if it is false.
The table, however, points to a disturbing fact. For small values of the VaR parameter p, it becomes increasingly difficult to confirm deviations. For instance, the nonrejection region under p = 0.01 and T = 252 is N < 7]. Therefore, there is no way to tell if N is abnormally small or whether the model systematically overestimates risk. Intuitively, detection of systematic biases becomes increasingly difficult for low values of p because the exceptions in these cases are very rare events. This explains why some banks prefer to choose a VaR confidence level, such as c = 95 percent and p=0.05, in order to be able to observe sufficient numbers of deviations to validate the model.
Conditional Coverage
So far, the framework focuses on unconditional coverage because it ignores conditioning, or time variation in the data. The observed exceptions, however, could cluster or “bunch” closely in time, which also should invalidate the model. With a 95 percent VAR confidence level, we would expect to have about 13 exceptions every year. In theory, these occurrences should be evenly spread over time. If, instead, we observed that 10 of these exceptions occurred over the last 2 weeks, this should raise a red flag. The market, for instance, could experience increased volatility that is not captured by VAR. Or traders could have moved into unusual positions or risk “holes”.
Whatever the explanation, a verification system should be designed to measure proper conditional coverage, that is, conditional on current conditions. Management then can take the appropriate action.
Such a test has been developed by Christoffersen (1998), who extends the LR_UC statistic to specify that the deviations must be serially independent. Another statistic has been developed to determine the serial independence of deviations known as LR_IND, and the combined test statistic for conditional coverage then is
Each component is independently distributed as chi square variable with 1 degrees of freedom [i.e. χ^2 (1)] asymptotically. The sum is distributed as chi square variable with 2 degrees of freedom [i.e. χ^2 (2)]. Thus we would reject at the 95 percent test confidence level if LR_CC > 5.991. We would reject independence alone if LR_IND > 3.841.
Exceptions do seem to cluster abnormally. As a result, the risk manager may want to explore models that allow for time variation in risk.
Basel Committee Rules For Backtesting
The Basel (1996a) rules for backtesting the internal model’s approach are derived directly from the failure rate test. To design such a test, one has to choose first the type 1 error rate, which is the probability of rejecting the model when it is correct. When this happens, the bank simply suffers bad luck and should not be penalized unduly. Hence one should pick a test with a low type 1 error rate, say, 5 percent (depending on its cost). The heart of the conflict is that, inevitably, the supervisor also will commit type 2 errors for a bank that willfully cheats on its VaR reporting.
The current verification procedure consists of recording daily exceptions of the 99 percent VaR over the last year. One would expect, on average, 1 percent of 250, or 2.5 instances of exceptions over the last year.
The Basel Committee has decided that up to four exceptions are acceptable, which defines a “green light” zone for the bank. If the number of exceptions is five or more, the bank falls into a “yellow” or “red” zone and incurs a progressive penalty whereby the multiplicative factor k is increased from 3 to 4, as described in this table 3-3. An incursion into the “red” zone generates an automatic penalty.
Within the “yellow” zone, the penalty is up to the supervisor, depending on the reason for the exception. The Basel Committee uses the following categories of causes for exceptions, and the subsequent application of penalties:
Basic integrity of the model – Exceptions occurred because of incorrect data or errors in the model programming. The penalty “should” apply.
Model accuracy needs to be improved – Exceptions occurred because the model does not describe risks with enough precision. The penalty “should” apply.
Intraday trading – The exceptions occurred because positions changed during intra-day trading. The penalty should be “considered”, but not necessarily applied.
Bad luck – The exceptions occurred because market became volatile due to various reasons. These exceptions should be expected to occur at least some of the time. No penalty guidance is provided.