Micky Midha is a trainer in finance, mathematics, and computer science, with extensive teaching experience.
Updated On
Learning Objectives
Explain best practices for the collection of operational loss data and reporting of operational loss incidents, including regulatory expectations.
Explain operational risk-assessment processes and tools, including risk control self-assessments (RCSAs), likelihood assessment scales, and heatmaps.
Describe the differences among key risk indicators (KRIs), key performance indicators (KPIs), and key control indicators (KCIs).
Describe and distinguish between the different quantitative approaches and models used to analyze operational risk.
Estimate operational risk exposures based on the fault tree model given probability assumptions.
Describe approaches used to determine the level of operational risk capital for economic capital purposes, including their application and limitations.
Describe and explain the steps to ensure a strong level of operational resilience, and to test the operational resilience of important business services.
Operational risk refers to potential loss from internal/external factors. Understanding past losses is crucial for risk management. Many firms use incident data collection as the foundation of their ORM framework. ING’s ORM framework, created in 2000, consisted of four concentric circles: loss database, RCSAs, KRIs, and lessons learned. This model simplified non-financial risk management into four key activities.
Regulators focus on incident data collection and quality, especially after the Basel SA reform in Dec 2017. Loss data has a vital role in ORM beyond regulatory capital requirements. Incident data helps identify control weaknesses, prevent further failures, and improve business performance. Internal data informs scenario identification and capital reserves. External data offers insights into peer firms’ risk exposure. Incident data is also important for regulatory Pillar 2 capital as it enhances operational risk management and reduces regulatory capital add-ons. High-quality data collection and analysis are crucial for this.
Data Quality And Availability
A. Regulatory Requirements Regarding Operational Risk Data Collection
The BCBS has eight criteria for data quality and collection processes under the SA. Key requirements include
a 10-year data history,
a minimum collection threshold of €20,000,
mapping to Basel vent-type categories,
reporting of occurrence and recovery dates, and
independent review of data accuracy.
Other requirements pertain to governance and exclusion of certain types of loss events.
B. Incident Data Collection Process
The Basel Committee recommends that “a bank must have documented procedures and processes for the identification, collection and treatment of internal loss data”.
Guidelines for operational loss data reporting are crucial to ensure data quality. Operational risk events differ from market and credit risk events as they can be more difficult to fully characterize and quantify, and their impacts are open to interpretation. For example, a bug in a banking app could lead to reputational damage, management attention, and IT resource use. Identifying and quantifying such impacts is less straightforward than recording credit or market losses. Clear and careful reporting processes are necessary, even for less complex incidents.
Firms use a core set of data fields to report operational incidents and should keep the number of fields to a minimum to avoid overload. This table presents an example of a set of fields used for operational incident data collection. Incident records must follow the taxonomy of risks, causes, impacts, and controls. Drop-down menus are essential for standardization, while free-text can be confusing and should be avoided in key fields for loss data analysis. Large banks are using NLP or machine-learning technology to extract useful information from existing databases that use large amounts of free-text.
Unique Incident ID
Place of occurrence (business unit/division)
Event type (level 1, level 2)
Event title & description (as standardized as possible)
Cause type (level 1, level 2)
Controls that failed
Dates of occurrence/discovery/reporting/settlement
Expected direct financial loss (may evolve until closure)
Impact type: loss/gain/near miss
Indirect effects (per type): often on a scale
Recovery (insurance & other recoveries)
Net loss (gross loss minus recovery)
Action plans (when appropriate): measures, owner, time schedule
Link with other incidents (if any) – for grouped losses and reporting
Other comments if necessary
C. Comprehensive Data
According to BCBS, a bank’s internal loss data must be comprehensive and capture all material activities and exposures from all appropriate subsystems and geographic locations.
The Basel committee sets a minimum threshold for loss reporting at €20,000, but many firms apply lower reporting thresholds, some even at zero. However, this practice is fading away in large institutions due to the burden of effort. Other banks and insurance companies set reporting thresholds at 10,000 $, €, or £, more rarely at 20,000. Reporting thresholds are commonly set at 1,000 or 5,000 $, £, or €. The choice of the reporting threshold should not affect the credibility of the process nor impair management information, and what to include as a “loss” will determine the size and materiality of the incident.
Regulators require reporting only of operational incidents resulting in financial losses. However, organizations also collect information on other types of operational events, including unintentional gains, incidents without direct financial impacts, and near misses. From a management perspective, it is good practice to also record non-financial impacts, such as reputational damage, customer detriment, disruption of service, or management time and attention. The term “non-financial” impact is misleading because the indirect consequences of many operational risk events have real financial implications, such as regulatory scrutiny, customers’ dissatisfaction, and costly remediation plans. Underestimating these costs can lead to underestimation of operational risk and poor operational performance.
Direct losses are those that result directly from the operational risk event itself, such as remediation costs, compensation to affected parties, regulatory fines, and wrongful transaction losses. Indirect losses, on the other hand, result from the further consequences of the event, such as customer attrition, low employee morale, increased compliance costs, and higher insurance premiums. While direct losses are typically well captured, indirect losses are often assessed using an impact rating and are only monetized by more mature ORM frameworks to inform proper decision-making.
Grouped losses are distinct operational risk events that share a common cause. Examples include the same wrong advice given to a group of clients or an IT failure affecting different departments. Grouping these losses is a regulatory requirement, a sound practice for modeling operational loss data, and necessary for producing accurate management information.
D. Dates of Incidents and Settlement Lags
The Basel guidelines require banks to report gross loss amounts and reference dates for each operational loss event, as well as any recoveries, and information about the drivers and causes of the event.
Each operational incident has four important dates –
Date of occurrence – when the event first happened
Date of discovery – when it is first identified
Date of reporting – when it enters the reporting database
Date of accounting – when the financial impact enters the general ledger
Data Quality And Availability
In extreme cases, operational risk events can take years to be discovered, especially in cases of small, internal frauds or data leakages due to hackers or manipulations of suspense accounts. The time to discovery is the main factor explaining the years-long gap between occurrence and settlement date, which also worsens the impact.
Regulators do not require a specific date type for internal reporting purposes, but the gap between occurrence and discovery provides information on managing operational risks. The difference between discovery and reporting shows how diligently incidents are reported to the risk function. Organizations have a maximum allowed time for reporting incidents in their policy. Material incidents should be reported within a few working days, while minor incidents can be included in periodic summary reporting.
Operational incidents can have effects that take years to settle. For example, the losses greater than $100 million incurred by 31 global systemically important banks (G-SIBs) before the Global Financial Crisis commonly took three to five years to settle from the reporting date. This time gap between the occurrence and settlement of operational risk events is significant, regarding the relevance of older losses for future modeling and management lessons.
E. Boundary Event Reporting
Boundary events are events that occur in a different risk class than their cause, such as a credit loss caused by incorrect collateral recording or a market loss due to an error in booking a position. These events are recorded where they materialize, even if they are caused by another risk class. Credit loss models for regulatory capital are based on historical losses, so past losses caused by operational risk are covered by regulatory capital in the credit risk class. Boundary events are not included in market risk models and need to be recorded as operational losses.
To classify events properly, the Basel Committee provides useful guidance on boundary events as follows.
Operational loss events related to credit risk and that are accounted for in credit risk RWAs (Risk Weighted Assets) should be excluded from the loss dataset. Operational loss events that relate to credit risk but are not accounted for in credit risk RWAs should be included.
Operational risk losses related to market risk are treated as operational risk for the purposes of calculating minimum regulatory capital under this framework and will therefore be subject to the standardized approach for operational risk.
F. Data Quality Requirements
BCBS states – Banks must have processes to independently review the comprehensiveness and accuracy of loss data.
The quality of data collection is crucial for assessing the extent of losses. Banks need to ensure that all significant losses are recorded, which can be achieved by reconciling the operational loss data with other sources such as the general ledger (GL). The GL captures all cash inflows and outflows, including unintentional gains and losses, making it a useful tool to benchmark the comprehensiveness of the loss database. However, the GL cannot replace a dedicated operational event database as it only records direct financial losses and not indirect effects. Additionally, if a loss event results in reduced revenue instead of a net loss, it may not be captured in the GL.
IT logs are a common data source to collect operational risk incidents and compare with existing incident data. IT issues rated as Priority 1 or 2 can be considered proxies for operational risk incidents. Other data sources such as provisions for pending lawsuits, customer complaints, compensations, and negative press reviews can also serve as proxies or help assess the comprehensiveness of operational incident data. Departments may maintain their own logs of incidents that can also contribute to the collection of operational incident data. Comparing the operational risk database with other existing sources used as proxies is a good way to identify any underreporting.
Despite detailed regulatory requirements, operational risk incidents are often underreported by financial firms, especially smaller ones. This can be due to a lack of guidance, resistance to escalate errors, or cumbersome reporting processes. Firms incentivize reporting through soft encouragement to stringent oversight, with self-reporting required and the threat of disciplinary action for unreported events. Risk metrics are often included in managers’ performance evaluations and assessed by internal audit departments to ensure robust data collection processes. Regulators encourage these practices.
G. Data Features of Operational Risk
• Operational risk data are quite different from data for financial risks –
They are specific to each firm and mostly unrelated to the behavior of financial markets.
Their frequency and impact distributions can have very wide tails (extreme events).
They are scarce.
They can remain hidden for a long time.
Their interpretation can be complex depending on the causes and downstream consequences. The unique nature of operational risk data compared to credit and market risks has profound consequences for the measurement and modeling of operational risk.
Risk And Control Self Assessment (RCSA)
A. RCSA Process
RCSA involves assessing the likelihood and impact of operational risk for a business or division, including inherent risks and residual risks after evaluating controls. RCSAs are usually conducted annually, with some institutions moving to quarterly assessments due to increased volatility in some operational risks. Some organizations establish a dedicated risk assessment unit (RAU) within a business line, division, or process to conduct RCSAs.
RCSAs involve self-assessment workshops or questionnaires using risk software applications to evaluate inherent risks, key controls, and their effectiveness. Preventive controls decrease risk likelihood, while corrective controls decrease risk impact. RCSAs result in an understanding of inherent risks, control effectiveness, and residual risks.
RCSAs are largely judgement-based and qualitative, but mature firms require evidence of control testing before decreasing inherent risks to acceptable levels. Back-testing risk assessments against past incidents is common practice, revealing a tendency to underestimate likelihood and overestimate impact. While the cost-benefit ratio is questioned, RCSAs are widely used and frequently required by European regulators as a central method of operational risk assessment for firms.
Variations of RCSA include risk and control assessment (RCA) and the residual risk self- assessment (RRSA). RCA requires documentation of control testing, examples of control effectiveness, or comparison of loss experience. RRSA does not assess inherent risks, focusing instead on the risk level after considering controls in place.
B. Impact and Likelihood Ratings and Assessments
RCSA is a qualitative and accessible exercise, but subjectivity, biases, and limited data can hamper the reliability of output. Standardized descriptions of risks and precise definitions of ratings are necessary to ensure comparability across different assessment units and departments. The purpose of RCSA is not to produce a precise measure of risk exposures but to raise awareness of risks and controls, evaluate residual exposures, and prioritize risk-management action plans. Without comparable assessments, there is no proper prioritization of risk management actions.
C. Severity Assessment: Impact Scales
The impact scales used in RCSAs commonly include four types: financial, regulatory, customer, and reputation. Another impact scale that has gained importance is the impact on continuity of service. The scales typically range from “insignificant” or “low” to “catastrophic” or “extreme”. To ensure adaptability across departments of various sizes within an organization, impacts are often expressed in percentages of revenues or customer base rather than monetary amounts. This makes the impact definitions scalable and allows for better comparability between different business units. A $ 100,000 financial impact can be significant for a medium-sized department, but minor for larger business units. Expressing impacts in percentages of the value base (revenues or customers) makes the definitions scalable to all sizes. The table in the next page presents an example of such impact scales.
While the definitions of impacts, which may include regulatory, compliance, or reputational damage, are typically qualitative, additional metrics such as the percentage of customers affected, number of negative articles, and adverse publicity can help to provide further clarity on the severity of the impact.
Rating
Financial
Service delivery
Customer and reputation
Regulatory
Extreme
>20% of operating income
Critical disruption of service, resulting in major impacts to internal or external stakeholders
Significant reputational impact, possibly long-lasting, affecting the organization’s reputation and trust toward several groups of key stakeholders
Significant compliance breach, resulting in large fines and increased regulatory scrutiny
Major
>5-20% of operating income
Significant service disruption affecting key stakeholders, requiring the crisis management plan to be activated
Compliance breach resulting in regulatory fines, leading to lasting remediation programs with reputation damage
Moderate
>0.5-5% of operating income
Noticeable service disruption with minimal consequences for stakeholders, service recovery on or under the RTO (recovery time objective)
Minimal reputation impact affecting only a limited number of external stakeholders. Temporary impact if mitigated promptly
Some breaches or delays in regulatory compliance, needing immediate remediation but without a lasting impact
Low
<0.5% of operating income
No service disruption affecting external stakeholders
No external stakeholder impact
Minor administrative compliance breach, not affecting the organization’s reputation from a regulatory perspective
Example of impact scale definitions and ratings
D. Likelihood Assessment Scales
Likelihood scales are expressed as percentages or frequency of occurrence. When discussing a 1-in-10-year event, risk managers mean a 10% chance of occurring within the next year, not over a 10-year horizon. RCSA exercises typically have a time horizon of one year, sometimes shorter, which is important for rapidly evolving risks such as cyberattacks, technological changes, and regulatory sanctions. Historical data is used to evaluate the likelihood of operational risks, but the past may not be a good predictor of the future in unstable environments. This table presents an example of likelihood scales commonly used in the financial industry when qualifying operational risks.
Qualitative Rating
Frequency of Occurrence
Probability of Occurrence % (one year horizon)
Likely
Once per year or more frequently
> 50%
Possible
> 1 – 5 years
20 – 50%
Unlikely
1 in 5 – 20 years
5 – 20%
Remote
< 1 in 20 years
< 5%
Example of likelihood scale
E. Combining Likelihood and Impact: The Heatmap
The RCSA matrix or heatmap combines the dimensions of risk quantification: likelihood and impact. The various combinations of impact and likelihood are assigned colors to represent the intensity of the risk, typically red-amber-green or red-amber-yellow-green. Heatmaps color-code risk levels based on the organization’s risk appetite to specific combinations of likelihood and impact, with the following meanings –
Green (or equivalent) – The current risk exposure level is within the risk appetite, and no actions are required other than routine monitoring to ensure that activities and controls are carried out as planned.
Yellow – The current risk exposure level is within appetite, but it is approaching excess levels. Active monitoring is underway and further mitigation may be required.
Amber – The current risk exposure exceeds acceptable levels, and it must be the focus of an action plan to reduce residual risk exposure in likelihood or impact, or else management must escalate the risk and accept it.
Red (or equivalent) – There needs to be an immediate risk mitigation action plan because the current risk exposure is significantly higher than the risk appetite.
The colors in the matrix and risk appetite are determined by the definition of impacts and likelihood on the axes. However, these colors and qualitative ratings from RCSA are non-linear and non-continuous measurements of risks. Multiplying probability and impact to reduce risks to a single numerical quantity is a common mistake that can lead to inaccurate and misleading results. For instance, a frequent and low-impact risk (1×4) is not equivalent to an event with a rare probability but extreme impact (4×1).
KRIs, KPIs and KCIs
Key Risk Indicators (KRIs)
KRIs are monitoring metrics that indicate changes in the level of a risk in terms of potential impact or likelihood. Operational risk KRIs have been a focus of attention due to the business world’s emphasis on predictability and control. Decision-makers in both operations and senior management seek a set of metrics to anticipate operational risk materialization or intensity.
In operational risk, a KRI is a metric that informs the level of exposure to a risk at a specific time. Preventive KRIs indicate changes in the intensity or presence of a risk cause, signaling an increase in likelihood (KRI of likelihood) or impact (KRI of impact) if the risk materializes.
Examples of KRIs used in measuring likelihood include the following –
Increase in number of transactions per staff member (risk of errors)
Drop in customer satisfaction scores (risk of client attrition)
Increase in the level of sales required for sales staff to achieve a performance objective (risk of fraud)
Examples of KRIs measuring impact include the following –
Increase in the level of responsibility or proprietary company knowledge held by a key employee (higher impact of is continuity/loss of knowledge in case of departure/sickness)
Increase in sensitivity of data held on given server (higher impact in case of data leakage/loss)
Increase in value generated by top-10 clients (higher impact in case of client attrition)
Key Performance Indicators (KPIs)
KPIs evaluate the effectiveness of a business entity’s operations by measuring its performance or progress towards meeting its objectives. KPIs and KRIs have significant overlap.
Examples of KPIs include:
Maximum downtime/uptime of IT systems
Error rates on retail transactions
Client satisfaction score/Net promoter score/Level of client complaints
Percentage of services performed within service level agreements (SLAs) levels
Key Control Indicators (KCIs)
KCIs measure the effectiveness of controls, either in design or performance. Examples of KCIs include the following –
Missed due diligence items
Lack of segregation of duties
Overlooked errors in four-eye checks (reviews by two competent individuals)
Incomplete identification files
KPIs, KRIs, and KCIs often overlap, sharing elements of performance, risk, and control indicators. Poor performance can become a source of risk, and control failures are clear preventive KRIs. In fact, control function failures are jointly a KPI, KRI, and KCI. For example, overdue confirmations of financial transactions can indicate poor back-office performance, increased risk of legal disputes, processing errors, or rogue trading, and signal control failures in transaction processing.
KRI thresholds and governance show an organization’s risk appetite and metrics used to monitor its objectives and risks. Intervention thresholds indicate management’s strictness in controlling and mitigating risks. Limits on key-person exposure, vacancies, system capacity, and backups reflect the organization’s risk tolerance. A strong KRI program indicates a mature risk management function, providing useful information on the organization’s risk management and control level.
Data analytics and machine learning advancements enhance preventive KRIs selection and design. Analyzing data trends, patterns, and outliers can differentiate true risk signals from normal business volatility, particularly in operational risk areas with sufficient data points. Examples include credit card fraud, systems operations, and financial market trading. Text analytics, such as NLP, can extract keywords from risk event descriptions to classify and analyze them.
Quantitative Risk Assessment: Factor Models
Quantitative approaches for assessing operational risk analyze the drivers of likelihood and impact of potential incidents. The techniques covered in this section belong to the causal analysis family of operational risk assessment and have become increasingly important for measuring operational risk and capital since the 2010s. These techniques include fault tree analysis (FTA), factor analysis of information risk (FAIR), Swiss Cheese models, and bowtie tools, which are all factor models that focus on control layers and effectiveness. Although the results of these models depend on the identified factors and their probabilities, breaking down risk estimates into their components enhances the robustness and transparency of the process.
Causal analysis is used to quantify operational risks that have the potential to significantly impact a firm, even though they are rare events. This approach focuses on forward-looking determinants of impact and likelihood, rather than past losses. Quantitative risk assessment, particularly for scenarios, is challenging and requires a structured and data-driven approach to avoid exaggerations and distortions. Regulators recognize the difficulty of assessing rare risks and call for a repeatable process that reduces subjectivity and biases. Firms are encouraged to use empirical evidence, provide clear rationale for their assumptions, and maintain high-quality documentation.
Fault Tree Analysis (FTA)
Fault tree analysis is a deductive failure analysis technique developed in the 1960s primarily used in high-risk industries such as nuclear energy, pharmaceuticals, and aeronautics. It decomposes failure scenarios into various external and internal conditions, focusing on safety failures and control breakdowns. This method is gradually being adopted in the financial industry to assess scenarios that could lead to significant negative impacts such as cyber attacks, system disruptions, or compliance breaches. FTA utilizes a series of “AND” or “OR” conditions that must occur for a disaster to happen.
EXAMPLE OF FTA: ACCIDENTAL DATA LEAKAGE BY AN INSIDER
1.
There is a targeted cyberattack on the firm via a phishing e-mail. (P1)
2.
The firewalls fail. (P2)
3.
The phishing e-mail is opened by an employee. (P3)
4.
The detective controls of suspicious network activity fail per unit of time. (P4)
5.
There is an exit of the leaked information per unit of time. (P5)
The joint probability of independent events is equal to the product of their individual probabilities. Example of FTA for accidental data leakage at a financial firm is presented in this box. The five conditions that all need to be present for the scenario to materialize are considered. Probabilities for each condition (P1 – P5) need to be estimated. P1 is assumed to be close to 100% due to high attractiveness of financial firms for cyber attackers, while P3 depends on employee awareness and vigilance (5% – 50% click on malicious links). Control failures (P2, P4, P5) depend on the level of cyber defense sophistication. Assuming independent conditions, the probability of risk is the product of individual probabilities – P1 x P2 x P3 x P4 x P5. The probability of failure per employee is crucial. For example, if it is assumed that – P1 = 1, P2 = P4 = P5 = 0.02, P3 = 0.15, then the joint probability of risk is 1 × 0.02 × 0.15 × 0.02 × 0.02 = 0.0000012. This appears quite small, but for a bank with 100,000 employees, it becomes 0.0000012 X 100000 = 0.12 or 12%. This model assumes an equal failure rate for all employees but can be made more complex.
When assessing scenarios, probability must be multiplied by the unit of exposure, not impact. For example, a 1/1000 error rate in manual financial transactions could lead to thousands of errors when millions of payments are processed annually. Neglecting exposure can result in underestimating the likelihood of a scenario.
In reality, failures are often not completely independent, as controls are usually designed together. Therefore conditional probablities (or Bavesian probablities) are necessary to obtain a more realistic sense of the probability of occurrence. Probabilities are updated with new information or events. For instance, if there was an IT service disruption, the probability of client-processing error would increase. In the case of this figure, it would be the probability of the detection mechanism failing, given the failure of firewalls.
Bayesian models in operational risk refer to measurement methods where likelihood assessments are updated by new information or new events. Conditional probabilities are typically used to update expert opinions based on past data in risk and scenario assessment, or to calculate conditional probabilities in scenarios such as described in this figure.
As discussed in Part 1, the Bayesian approach is the process of updating prior probability beliefs in light of new information to form new estimates.
Factor Analysis Of Information (FAIR)
The FAIR model is a factor model for quantifying operational risk used in finance. It decomposes risk into factors, following these steps –
identify risk factors and their interrelationships,
measure each factor, and
combine the factors computationally.
This generates a distribution of losses for a given scenario.
The FAIR methodology requires a scenario to include the following –
an asset at risk (the asset of the scenario),
a threat community (who or what the threat is),
a threat type (the nature of the threat), and
an effect (the losses resulting from the materialization of the risk).
Business experts then estimate the frequency and magnitude of the loss, which are expressed as distributions, not single points. These distributions are used as inputs for Monte Carlo simulations to generate the distribution of simulated losses for the scenario.
Swiss Cheese Model – Control Layering
The “Swiss Cheese” model, also known as the “cumulative act model”, was created by James Reason, a psychologist who specializes in studying human errors in high-risk industries like healthcare. Reason explains that defensive layers are like slices of Swiss cheese with holes that continually shift. Holes in a single “slice” usually don’t lead to a bad outcome. Only when holes in multiple layers line up, can accidents happen. This is known as the “trajectory of accident opportunity”, bringing hazards into contact with victims.
Effective control layering is needed to prevent the alignment of “holes”, meaning that control failure rates must be independent, and controls should compensate for each other’s weaknesses to create a safety net instead of a chain reaction. Risk managers should prioritize assessing the independence of controls in a system, as much as they prioritize assessing the reliability of each individual control.
Root-Cause Analysis And Bow-Tie Tool
Performing consistent and systematic root-cause analysis of significant operational risk events and near misses is a best practice recommended by regulators. The first line should investigate incidents or near misses that lead or could lead to operational impacts above the materiality threshold, with support or challenge from the second line. Comparing the results of previous investigations can uncover common weaknesses in the organization, leading to action plans that address the causes of multiple operational risk events at once. Risk managers may notice patterns that reveal common weaknesses in the organization, enabling action plans to address multiple operational risk events at once.
The bow-tie, commonly used in industries like oil and gas, is now popular in finance as well. Like the “5-why” analysis, it identifies multiple levels of causes for operational failures. The diagram features the risk or event at the center, with causes on the left and impacts on the right forming a bowtie shape. Preventative barriers are in front of each cause, while detective and corrective barriers are on the impact side to the left.
Bowtie analysis is valuable for assessing risk likelihood and impact, as well as for root-cause analysis of incidents and near misses. It can help quantify plausible impacts and expected frequencies, identify indirect and root causes of risks and incidents, and identify key risk indicators and monitoring metrics.
Root-cause analysis provides insights into the causes of risks and potential control failures, helping assess the probability of events, similar to an FTA. Bowtie analysis goes further by exploring controls in place, incident impact, and direct/indirect impacts, providing a structured way to assess risks and mitigating factors. It can be used as a “5-why” analysis, investigating the reasons for a risk. It can also be applied in the opposite direction as a “5-then-what,” investigating the consequences. It can be used to assess the speed of incident detection and the organization’s response in mitigating its impacts. This is especially relevant in case of volatile and unpredictable events such as cyberattacks, pandemics, and extreme weather.
Operational Risk Modeling And Capital
This section explores the main methodologies utilized by Tier-1 banks and insurance companies for modeling operational risk to calculate their economic capital. These models belong to the stochastic family and use statistical methods. They fit a statistical distribution to the frequency and severity of past losses and simulate the distribution of losses for the organization. The economic capital is then determined by extrapolating the 99.9% percentile of the estimated loss distribution.
The most commonly used technique in operational risk modeling is the loss distribution approach (LDA), which is often supplemented by extreme value theory or scenario data for tail risk estimates.
Loss Distribution Approach (LDA)
The LDA decomposes a loss data distribution into frequency and severity components, doubling the data points available for modeling. This helps with modeling activities, which require a lot of observation data points, especially when modeling data was scarce in the early days of operational risk.
The LDA method estimates frequency and severity distributions separately and then combines them into an aggregated loss distribution, as shown in this figure. Monte Carlo is the most popular method used to combine the distributions, which involves generating an aggregated distribution by drawing random samples of severity and frequency, typically a million or more. Other equation-based methods, such as Fast Fourier transform and Panjer recursions, require more mathematical skills and coding but less computing time.
Frequency Modeling
The frequency distribution is discrete. Operational risk events are counted per year and modeled with a Poisson distribution, which has a single parameter representing both the mean and variance. According to a 2009 study, 90% of firms using the AMA modeled frequency with a Poisson distribution, while negative binomial distributions were used in the remaining 10%.
Poisson distribution has the following attributes –
It gives the probability of an event occurring in a fixed time interval.
It has only one parameter, \(\lambda\), which is equal to both the mean and variance of the distribution (\(\lambda\) is sometimes called the intensity rate, intensity factor, or hazard rate)
The probability mass function is given by
P(X = k) = \(\frac{\lambda^k e^{-\lambda}}{k!}\)
Observations are assumed to be i.i.d. (independent and identically distributed).
Severity Modeling
Severity distributions in operational risk are continuous, asymmetric, and heavy-tailed to account for many small events and a few very large losses. The lognormal distribution is the most common method to model severity, but heavy-tailed distributions such as Weibull and GPD have become more popular over time.
Poisson distribution has the following attributes, which have already been discussed in FRM Part 1 –
It is a continuous probability distribution of a random variable whose (natural) logarithm is normally distributed, i.e., if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution (\(\mu\),\(\sigma\).
A random variable that is log-normally distributed takes only positive real values. The probability density function is given by
\(f(x; \mu, \sigma) = \frac{1}{x \sigma \sqrt{2 \pi}} e^{-\frac{(\ln x – \mu)^2}{2 \sigma^2}}, \quad x > 0\)
The mode is located at \(e^{\mu – \sigma^2}\)
Limitations and Constraints of LDA
LDA for operational risk modeling has assumptions and limitations.
The independence assumption between the severity and frequency of operational losses is a strong condition, which does not align with the contrast in practice between the frequency of small and large operational risk events.
In addition, operational risk data is divided into risk classes or units of measure, and the losses within each class are expected to be independent and identically distributed (i.i.d.). This means that there should be no implicit correlation within a class, and all losses should follow the same distribution generated by the same mechanism. However, this assumption is challenging to fulfill due to the diverse nature of operational risk events.
To test i.i.d., losses are grouped into clusters with as much homogeneity as possible and tested for the absence of autocorrelation. However, the loss distributions for these individual risk classes must be eventually aggregated using explicit correlations, either assumed or calculated.
Interrelationships and Correlations of Operational Risk Events
LDA is used on each cluster of losses using a UoM (unit of measure) which is then aggregated to get the total loss distribution. If only LDA is used for capital modelling, the 99.9th percentile of this distribution is the stand-alone capital for operational risk.
Clustering operational risk events into homogenous UoMs is difficult due to their heterogeneous nature. Examples of possible UoMs for different event types and business lines include –
Individual business lines for external fraud events
All business lines for damage to physical assets events
Business entity for internal fraud events (under the same senior management and supervision)
Individual business lines for processing error events.
Good UoMs have homogenous operational risk events driven by similar factors with similar dynamics and distributions. However, there is a trade-off between homogeneity and data availability. Finer segmentation leads to more homogenous data and granular models, but less data available can increase uncertainty and complexity in the aggregation process.
The model dependency framework greatly affects capital results and is closely monitored by validation teams and regulators. UoMs should align with the business structure and the model’s ultimate use. Copulas, which can model advanced dependency structures, are used to aggregate UoMs, including tail dependence and dependence between extreme values or tails of distributions.
LDA models are insufficient for modeling the extreme skewness of operational risk losses. To model the tail of the distribution, higher value losses can be added using extreme value theory, scenario data, or both.
Extreme Value Theory (EVT)
There are two approaches to Extreme Value Theory (EVT) illustrated graphically in this figure –
The Block Maxima (Fisher-Tippet) method analyzes the behavior of equally spaced maxima in time. This involves looking at the maximum operational loss per period of time and per UoM and examining the distribution of these events.
The Peak-over Threshold (POT) method focuses on observations that exceed a high threshold “u” that is deemed “sufficiently large”. A theorem shows that for high enough threshold u, the excess distribution function can be approximated by the Generalized Pareto Distribution (GPD). As the threshold increases, the excess distribution converges to the GPD.
Two approaches to EVT
Limitations and Constraints of EVT
The validity of EVT is subject to specific regularity conditions. To apply EVT to operational losses from various institutions, business lines, and risk types, it would require a single mechanism to be responsible for all the observed losses and future losses exceeding current levels. This condition is unlikely to hold, and thus, the applicability of EVT in operational risk is uncertain. Other approaches, such as scenarios, causal modeling, fixed multipliers, and macro determinants of operational risk, are explored in other sections.
EVT requires ample data to generate reliable quantiles and accurate predictions of future risk levels. Insufficient data can lead to inaccurate estimates and unreliable predictions.
Internal And External Loss Data
Internal loss data is critical to LDA in operational risk modeling, providing valuable insights into risk and control failures. It offers abundant data points to the modeler, enabling accurate estimates of moderate losses in the high-frequency range. Furthermore, internal data is often more relevant and detailed than external data, facilitating data cleaning, audit, and model review activities. Additionally, comprehensive internal incident databases support causal models by providing information on risk exposure and failure causes, which feed scenario analysis and assessment.
There is often a discrepancy between the internal incident database maintained by the risk management function and the calculation dataset used by modellers. Institutions must establish guidelines for when to include incidents in the calculation dataset, particularly for special cases like near misses, accidental gains, and rapidly recovered losses. Different institutions have varying approaches to this, which can affect the reliability of model results.
To reflect the full range of possible events and outcomes that might affect an institution, internal data needs to be supplemented with external data from peers or related industries. External loss data is typically sourced from databases that collect and classify operational Microsoft PowerPoint – OR 4 – Risk Measurement and Assessment (2)
incidents, either publicly available or for members only. Public data provides full information on the institution involved in each event, allowing for context comparison and relevance judgment. However, public data is scarcer than members-only data, which may decrease the model’s stability.
Industry associations and membership organizations collect internal loss data from members and share it anonymously. ORX and ORIC International are the market leaders in this field. Financial organizations use various criteria to choose external loss data sources that complement internal data and come from comparable peers. These criteria include geographical distribution, sector or business lines concentration, length of data series, data quality, membership size, and reporting threshold. Membership databases are abundant but anonymous and provide minimal information about each event.
Combining internal and external loss data for model calibration involves crucial methodological decisions that can impact the outcomes. Operational risk managers should be aware of the following choices and be able to communicate effectively with modelers –
Scaling adjusts losses to fit different dimensions such as the size of the institution or inflation. For example, a loss recorded by a large international bank can be proportionally scaled down to fit a smaller national bank.
Cut-off mix determines the threshold for including external data in the model, usually when internal data become scarce, and more data points are required to estimate a distribution.
Filtering involves selecting criteria for including or excluding losses in a model. Regulators require institutions to establish clear rules to prevent manipulation of results and avoid cherry-picking data.
Capital Modeling For Operational Risk
The new Standardized Approach for operational risk capital calculation, effective January 1, 2023, eliminates the formal requirement for financial institutions to model their operational risk capital. However, large financial institutions with over 20 years of experience in modeling operational risk capital are unlikely to abandon this knowledge. Instead, they are transferring their modeling efforts and personnel from Pillar 1 to Pillar 2 and focusing on ICAAP, stress- testing calculations, and operational resilience.
Under AMA and IMA regulations, financial institutions can assess their own capital needs with approval from their country’s regulatory authority. Banks use internal models to determine the necessary capital to cover possible operational losses up to a 99.9% confidence interval for one year. AMA adoptions were primarily from European countries and required validation by independent parties and meeting around 30 qualitative and quantitative criteria for operational risk management.
The advanced approaches aim to reflect the financial institution’s risk profile accurately. Four types of input are required to build a qualifying model –
internal loss data (ILD),
external loss data (ELD),
scenario data (SD), and
business environment and internal control factors (BEICF).
ILD and ELD represent the input from past observations, while SD and BEICF provide prospective information on potential risks and control systems.
CASE STUDY: MODEL COMPONENT OF A TIER-1 EUROPEAN BANK
The model explicitly includes the four inputs of ID, ED, SA, and BEICF. The body and the tail of the loss distribution have been calibrated using two different statistical distributions. Modelers preferred simplicity and transparency over mathematical complexity, which increased model stability. The model also traces each contribution to the final capital estimate: loss events and scenarios have identifiable impacts on the resulting capital number. This full traceability of the model made it acceptable and actionable by the business.
The ICAAP is a self-evaluation of a regulated firm’s level of capital in consideration of their specific risk profile, covering all financial and non-financial risks. It is used to assess capital needs, given the business model and medium-term plan, risk exposures, and controls in place. ICAAP is present in several countries, but the US focuses more on CCAR stress testing for large banks. To estimate required capital for operational risk, organizations use a simple multi-criteria method or more sophisticated techniques for larger banks and insurance companies.
The ICAAP process results in a report sent to the regulator, detailing the selection and quantification of scenarios. Firms explain each scenario or capital calculation and rationale for excluding scenarios. The ICAAP can inform a financial institution’s economic capital, which covers possible losses to maintain external credit rating or guarantee survival. Economic capital and ICAAP share the same purpose.
OPERATIONAL ICAAP: OUTLINE
The outline reproduced below is the table of content of an ICAAP report as recommended by a regulator for small- to medium-sized firms:
1.
Executive Summary
2.
Business Model, Strategy, and Forecasts
3.
ICAAP Process and Risk Governance
4.
Risk Appetite, Tolerances, and Monitoring
5.
Scenarios: Identification, Mitigation, Assessment, and Stressed Assessment – (Pillar 2a)
6.
Capital assessment, planning, and environmental stress testing – (Pillar 2b)
7.
Challenge and adoption of the ICAAP (within the firm)
8.
Wind Down Planning
9.
Conclusion and next steps
Quantification Of Operational Resilience
Operational risk disasters can devastate revenue, reputation, and long-term earnings, stemming from system failures, cyberattacks, physical damage, or compliance breaches. Regulations require firms to assess their resilience to such disasters, even if they have not occurred before. Scenario analysis provides crucial input for tail distribution modeling and helps determine necessary preventative measures. Effective operational risk management improves operational resilience, which regulators consider an essential outcome. BIS notes that risk identification, assessment, mitigation, and monitoring work together to minimize disruptions and improve resilience.
Resilience frameworks differ from traditional business continuity management (BCM) in that BCM focuses on process continuity and recovery within a specific timeframe (Recovery Time Objective, RTO), while resilience focuses on maintaining and recovering “important business services” within designated “impact tolerances”. Operational resilience strives to maintain critical business services in all but the most exceptional circumstances, although the definition of “most extreme conditions” is not explicitly defined.
Business continuity management (BCM) focuses on individual business processes, while resilience centers around important business services (IBS), with similar risk management and mitigation methods. The definition of IBS varies depending on the institution, its business, and customers. Prioritizing risk management resources and contingency planning is critical since not all IBS may be considered essential. The mapping of dependencies between IBS, identifying critical paths, and deploying resources for quick recovery is a complex part of resilience assessment. Large institutions require multi-disciplinary teams from departments such as ORM, BCM, Human Resources, Third-Party Management, IT, Operations, and Infrastructure to support the resilience team.
Steps for firms to ensure operational resilience –
Identify important business services by mapping out the entire service architecture, including dependencies.
Set impact tolerances for important business services based on required performance or availability levels.
Create a detailed end-to-end map of all important business services and identify resources required for delivery.
Design severe but plausible scenarios to test vulnerabilities in the delivery of important business services.
Examine lessons learned from stress tests and take actions to improve operational resilience where impact tolerance is exceeded.
Ensure internal and external communication plans are in place for incidents.
Conduct regular self-assessments to track progress over time and identify potential vulnerabilities and have the self-assessment document signed off by the Board annually.
BCM teams rely on impact tolerances established through a BIA, which assesses the consequences of disruptions to business processes and resources needed for recovery. This includes identifying operational and financial impacts such as lost sales, delayed income, increased costs, fines, penalties, customer loss, and delays in new business plans. The BIA report prioritizes restoration of business processes with the greatest impacts.
Resilience requires taking into account various types of single points of failure (SPOF) as key risk indicators (KRIs). It is important to minimize and eliminate SPOFs through backup and redundancy measures. However, if they cannot be removed, they should be factored into tolerance thresholds and stress tests. The following page contains a table outlining examples of dependency indicators that can serve as KRIs for assessing resilience.
KRIs for Resilience
• Key employee dependency
• Number of key employees without back-ups, without documented processes, with unique knowledge and skills, for example:
Coders
Pricing modelers
Cybersecurity specialists
Technical engineers (including physical assets)
• Key supplier dependency/low substitutability/high switching costs, such as: