SIL-4 despite insecure hardware
Author: Mehmet Özer, SYSGO AG
Contribution – Embedded Software Engineering Congress 2017
Railway safety standards (CENELEC – EN50128, EN50129, EN50126, etc.) have introduced uniform requirements for the development of safety-related electronic systems, encompassing both software and hardware, and have replaced local standards of individual countries. While standardization leads to a unified understanding of safety and quality, which is definitely positive for safety, it also forces companies to implement a more costly development and certification process for safety systems.
This article focuses on the use case of a railway signaling system. Although only the relevant CENELEC standards apply, this article describes some aspects from the perspective of IEC 61508. The reason for this is not that the CENELEC standards do not adequately address these aspects, but rather that the description in IEC 61508 is clearer.
The two safety standards EN 50128 (software for railway control and monitoring systems) and EN 50129 (safety-related electronic systems for signaling) are a specialization of IEC 61508 for functional safety and define generic (software) applications and generic (hardware) products that can obtain independent certification for railway applications. When building a complex safety system, such COTS (commercial off-the-shelf) products can be reused, including their existing certification artifacts. This approach allows safety-related electronics to be assembled from pre-certified software and hardware modules.
A pre-certified generic (software) application must comply with the rules of EN 50128. Due to the nature of software, this safety standard only considers systematic errors. To reduce the systematic error rate to an acceptable level, the standard defines techniques and measures for the specification, development, verification and validation, as well as for the operation and maintenance of safety-related software. As a rule of thumb, the effort required for software certification increases proportionally with the number of lines of code.
Unlike software, hardware can experience both systematic and random failures. Systematic hardware failures are addressed by adhering to rules during the development process. Architectural measures (e.g., diversity) can also mitigate the impact of systematic failures. Random hardware failures are addressed by calculating the probability of failure using statistical data and historical usage data. This works quite well for discrete hardware components, but the more complex the hardware, the more difficult it becomes to calculate component failures. A practical alternative is to detect the failure externally (through diagnostics) and transition the hardware or system to a safe state within a defined timeframe. This alternative allows developers to use even complex hardware components, such as multi-core processors, provided appropriate diagnostic methods are employed.
Safety Integrity Levels (SIL)
The primary function of a safety-related electronic system is to perform a safety function that must achieve or maintain a safe state of a monitored device in order to mitigate the consequences of hazardous events. The ability to perform this safety function is described by safety integrity, which is a measure of the probability that a safety-related system will perform its specified safety functions under all given conditions within a defined timeframe. The highest safety integrity level (SIL) is defined by the safety requirements for the system.
The goal of a safety-related system is to reduce the risk of a given situation, considering its probability and specific consequences, to a tolerable level. Determining what constitutes a tolerable risk for a specific application requires considering several factors, such as legal requirements, guidelines, and industry standards (e.g., IEC 61508, EN 50128, EN 50129, etc.). The conformity of a safety-related system with its assigned SIL level must, in principle, be mathematically verified for hardware.
In the case of SIL 3, a dangerous situation occurs every 1142 years (10-7 per hour) as acceptable (Table 1, see PDFSince electronic components do not allow for such long-term verification, architectural concepts such as hardware fault tolerance, device diagnostics, inspection and proof tests must be applied to reduce the risk of electronic failure.
Systematic errors
The IEC 61508 standard distinguishes between hardware safety integrity and systematic safety integrity when it comes to the safety integrity of electronic systems. Hardware safety integrity refers to random hardware failures, while systematic safety integrity relates to systematic failures. The term Common Cause Failure (CCF) describes random and systematic events that lead to the simultaneous failure of multiple devices (in a multi-channel system).
Errors are managed with different strategies depending on whether they are random or systematic. Random errors can be identified through internal device diagnostics, external diagnostics, inspection, and proof testing. While random hardware failures are primarily caused by aging and external influences, systematic failures are a direct consequence of system complexity. They are often introduced during the specification and design or implementation phases, but can also be caused by errors during manufacturing, integration, or by operator or maintenance errors. Systematic failures are considered predictable and can be controlled through the strict application of correct processes throughout the lifecycle of the electronic component and the implemented software.
The probability of systematic failures can also be reduced to a sufficiently low level through multi-channel design (fault tolerance through redundancy). This fault tolerance can be achieved, for example, through diversity, by using different technologies or products. Furthermore, different hardware architectures from various vendors can be used. In the case of software, the same algorithm can, for instance, be implemented in different programming languages and/or run in different runtime environments. Increased reliability through diversity is based on the assumption that different devices have different causes and modes of failure.
Random hardware failures
Random hardware failures occur at random times and result from physical deterioration of the hardware. This deterioration can be caused by manufacturing tolerances, abnormal process conditions (overvoltage and temperature), electrostatic discharge, device wear, etc., leading to hardware components failing during operation. While the failures occur with a predictable probability, they happen at unpredictable (i.e., random) times. Depending on the impact of a failure on the hardware, it is referred to as either a "soft failure" or a "hard failure." A soft failure is temporary and has no lasting consequences, while a hard failure permanently damages the hardware. As hardware becomes more complex, the probability of both soft and hard failures also increases due to environmental and operating conditions. This is exacerbated by the trend toward shrinking hardware geometry and increasing the density of transistors on silicon. Memory components, in particular, are susceptible to crosstalk and electromagnetic fields, which can lead to soft or hard failures.
Higher SIL levels – but how?
Table 2 (see PDF) is to be applied to complex devices according to IEC61508 Part-2.
Safe Failure Fraction (SFF)
To mathematically describe these quantities, we also need the parameter λ, which is a measure of the failure rate (failures per unit of time). IEC 61508 also classifies random hardware failures according to their impact on the safety system, specifically as safe λ.s and dangerous failures λD. A safe fault ensures that the device continues to guarantee safety in accordance with the safety concept. A dangerous failure, on the other hand, prevents the device from performing its safety function.
Safe and dangerous failures are further divided into the categories "detected" and "undetected." Thus, the failure rate is divided into four groups:
- λSD = safe detected failure rate (detected and harmless errors)
- λSU = safe undetected failure rate (undetected and harmless errors)
- λDD = dangerous detected failure rate
- λYOU = dangerous undetected failure rate
The Safe Failure Fraction, or SFF, refers to the proportion of safe failures (failures that can be treated as safe) to the total number of failures.
SFF = (λSU + λSD + λDD) / (λSU + λSD + λDD +λYOU)
Detected (safe and dangerous) failures are uncovered through diagnostic tests. In the case of a safe failure (detected or undetected), the safety function can be maintained. A dangerous failure detected via diagnostics can be considered a safe failure if effective measures are taken to bring the system to a safe state. Undetected dangerous failures lead to a loss of the safety function and must be kept to a minimum. Ideally, λYOU With a value of 0, an SFF of 1 ≙ 100% can be achieved. Such hardware would be calculated according to Table 2 (see PDFTheoretically, SIL 3 can be achieved.
Hardware Fault Tolerance (HFT)
Hardware fault tolerance (HFT) is the ability of a system to maintain its required function even in the event of hardware failures. A HFT of 1 means, for example, that there are two (redundant) devices performing the safety function, and the failure of one of them does not affect the safety function. A three-channel system where a single channel can continue the safety function in the event of a failure has a HFT of 2. The HFT can be easily calculated if the architecture is expressed as M out of N (MooN) (Table 2). In this case, the HFT is calculated as NM. In other words, a 2oo4 architecture has an HFT of 2. This means that such a system can tolerate 2 failures and still function.
Table 2 (see PDFThe diagram shows the worst-case and best-case SILs that can be claimed depending on the hardware fault tolerance and the safe failure fraction. With a low SFF (<60%), it would not be permissible to use a single-channel system without any hardware fault tolerance to support a safety function. However, provided the specified criteria for a higher SFF can be met, it would be possible to declare up to SIL 3 for a single-channel subsystem. As a rule of thumb, the achievable SIL level for a component with a specific SFF increases with the HFT of the system design. For example, if a device has an SFF between 90% and 99%, it can achieve SIL 2 without hardware tolerance and SIL 3 with an HFT of 1.
If the SFF is below 90% (but > 60%), an HFT of 2 also allows a SIL 3 design with these components.
Diagnostic coverage
Diagnostic coverage (DC) describes the proportion of critical failures that are detected by diagnostic tests. The mathematical description of this quantity also demonstrates the necessity of identifying critical undetected failures (λ).YOU to minimize to such an extent that a diagnostic coverage of 1 ≙ 100% can be achieved.
DC = λDD/ (λDD +λYOU)
Manufacturer-specific diagnostics for certain components are generally designed to detect all critical faults in those components, and therefore do not cover the entire system (e.g., a CPU temperature sensor). From a system perspective, additional diagnostics must be implemented to reduce the likelihood of critical system failures.
Part 2 (Annex-A) of IEC 61508 recommends techniques and measures for diagnostic tests and specifies the maximum possible diagnostic coverage that can be achieved with each measure (see Table 3, PDFThese tests can be performed continuously or periodically.
Table 3 shows an example of the tests suggested for RAM (variable memory). Table 4 (see PDFAnnex A lists all components for which various diagnostic tests are proposed. The required diagnostics depend on several factors, such as the assigned safety integrity level, the architecture, and the expected request rate (how often the safety function is requested per unit of time). There are certainly cases where specific diagnostic tests should not be performed if they would adversely affect the system's health. Detailed RAM tests, for example, are very time-consuming and may disrupt the application's runtime behavior.
Security-related software
When using a CPU board in a security system, software becomes an essential part of the security functionality. During early device initialization, the bootloader is a crucial software component. After booting, an operating system (OS) can be deployed to simplify hardware use. A software application running on the operating system can then perform the (security) functions, including diagnostic tests.
Software only recognizes systematic errors, which are generally caused by mistakes in the design phase. EN 50128 and IEC 61508 Part 3 describe methods and concepts to reduce the probability of software errors to a specific value. These methods roughly comprise the following phases:
- Requirements specification
- Software design and development process
- Verification and validation procedures
Detailed documentation and evidence must be prepared to demonstrate that an appropriate level of compliance with the prescribed rules has been applied. The effort required for validation tests is highly dependent on the amount of source code (source lines of code – SLOC) that needs to be tested. Lean source code significantly reduces the certification effort. This also applies to all software components, such as bootloaders, operating systems, application code, and built-in diagnostic functions. Therefore, using the right technology and validation approach has a considerable impact on the overall certification costs. Without exception, all industrial safety standards recommend separating safety-related and non-safety-related software, so that only the safety-related parts need to be certified.
Separation of applications through a separation kernel
While using separate hardware components for safety-critical and other applications ensures secure separation, it also leads to higher hardware costs. In contrast, an operating system based on a separation kernel enables the separation of safety-critical and non-critical application code on the same hardware platform by partitioning the hardware's physical and temporal resources. The separation of physical resources is referred to as spatial separation or resource partitioning, while the separation of available execution time is known as temporal separation or time partitioning. This separation principle can be compared to that of a hypervisor, but the key difference is that the separation provided by the separation kernel is non-reactive. This means that an error in one partition cannot propagate to other partitions.
In the nomenclature of a separation kernel, the isolated application areas are referred to as partitions. Separating applications into partitions ensures that they cannot interfere with each other, allowing each application to operate at its assigned Safety Integrity Level (SIL). This enables a single hardware platform to handle applications with mixed criticality levels. For example, a communication stack (TCP/IP, web server, OPC UA, etc.) can be hosted in a SIL 0 partition, while a safety-critical application runs in a SIL 4 partition. In such a case, each partition's contents only need to be certified for their respective SIL level.
Conclusion
The effort (cost and time) required to certify a Safety Related Electronic System increases with the complexity of the hardware and software components. Determining the failure rates for complex hardware (such as CPU boards) by examining individual components is virtually impossible. This is especially true given the short lifecycles of hardware, which means there is hardly any reliable usage data available to support a "proven in use" argument. As a way out of this dilemma, IEC 61508 allows for fault-tolerant design and diagnostic testing, enabling the system to be monitored so that it can be brought to a safe state in the event of a failure.
The effort required for software certification depends on the Source Lines of Code (SLOCs). Reducing the number of SLOCs is limited if specific functionality needs to be implemented. As we have seen, partitioning the software into different SIL classes can remedy this, as the respective application components only need to be certified for their respective SIL level.
At the component level, the CENELEC standards for railway and IEC 61508 allow the pre-certification of generic software and hardware components. This means that pre-certified COTS software and hardware components can be reused in different projects without recertification.
List of sources
[1.2] IEC 61508-2, Edition 2.0 2010-04: Functional safety of electrical/electronic/programmable electronic safety-related systems – Part 2: Requirements for electrical/electronic/programmable electronic safety-related systems
[1.3] IEC 61508-3, Edition 2.0 2010-04: Functional safety of electrical/electronic/programmable electronic safety-related systems – Part 3: Software requirements
[1.7] IEC 61508-7, Edition 2.0 2010-04: Functional safety of electrical/electronic/programmable electronic safety-related systems – Part 7: Overview of techniques and measures
[2] BS EN 50128:2011: Railway applications — Communication, signaling and processing systems — Software for railway control and protection systems
[3] DIN EN 50129:2003: Railway applications. Telecommunications, signalling and data processing systems. Safety-related electronic systems for signalling
[4] SYSGO White Paper: Safety certification for unsafe COTS platforms
Our training courses & coaching sessions
Do you want to bring yourself up to date with the latest technology?
Then find out more here Regarding training courses/seminars/workshops and individual coaching sessions offered by MircoConsult on the topic Quality, Safety & Security.
Training & coaching on the other topics in our portfolio can be found here. here.
Quality, Safety & Security – Expertise
Valuable expertise on the topics of quality, safety & security is available. here Available for you to download free of charge.
You can find expertise on other topics in our portfolio here. here.
