Approaches to Reliability Testing & Setting of Reliability Test Objectives
If the project context demands different types of reliability testing, the approach to reliability testing is governed by a following three factors:
1) Identified risks, in particular those relating to safety-critical systems
2) Applicable standards
3) Available resources
When
planning an approach to reliability tests it is worth bearing in mind
that some tests will be defined with one aspect of reliability in
focus but which might also be applicable to other reliability
aspects. If we decide, for example, to evaluate the recoverability of
a system, we may first need to cause that system to fail. The very
act of defining these tests (i.e., getting the system to fail) may
give us insights into the fault tolerance of our system.
At the planning stage, we need to set out the reliability objectives to be achieved and state how their achievement will be measured. This involves not only setting the end objective, but also considering how we expect reliability to gradually improve over time.
A commonly used time-based measure for reliability is the mean time between failures (MTBF), which is made up of the following two components:
1) The mean time to failure (MTTF), representing the actual time elapsed (in hours) between observed failures
2) The mean time to repair (MTTR), representing the number of hours needed by a developer to fix the problem
As per the software testing expert Ilene Burnstein, we should be precise in our measurements and that CPU execution time is often a more appropriate measure than simple elapsed "wall clock" time. This enables planned downtimes and other disturbances to be taken into account and removes the possibility of calculating overly pessimistic values of reliability.
Ilene Burnstein describes, a measure for reliability (R), which is based on MTBF and takes a value between 0 (totally unreliable) and 1 (completely reliable). The calculation of R is simply MTBF divided by (1 + MTBF). Clearly, the larger the value of MTBF (i.e., failures occur further apart), the closer R approaches (but, significantly, never reaches) 1.
If recoverability tests are included in our approach to reliability testing, it may be appropriate to define software testing objectives as under:
1) Failover:
# Test objectives are to create failure modes that require failover measures to be taken (possibly also associated with a time constraint within which this must happen).
2) Backup:
# Test objectives are to verify that different types of backup (e.g., full, incremental, image) can be completed, possibly within a given time period.
# Objectives may also relate to service levels for guaranteed data backup (e.g., master data no more than 4 days old, non-critical transaction data no more than 48 hours old, critical transaction data no older than 10 minutes).
3) Restore:
# Test objectives are to verify that a specified level of functionality (e.g., emergency, partial, full) can be achieved, possibly within a given time period.
# An objective may also be to measure the time taken to recognize whether any data losses or corruptions have occurred after a failure and restore the lost or corrupted data (possibly differentiated by the types of data backed up, as mentioned earlier).
It is not uncommon for one or more of the objectives to be carried over into production and monitored as Service Level Agreements.
Measurement of Reliability Levels:
We need many test repetitions to measure reliability levels. Tests to measure reliability levels are mostly conducted during the system test or (operational) acceptance test levels. This is primarily because these test levels present more opportunity for executing the test cycle repetitions necessary to measure reliability levels accurately. The repetitious nature of these reliability tests also makes them good candidates for conducting dynamic analysis in parallel, especially regarding memory leaks.
Tests aimed at measuring reliability levels can also be conducted in a highly controlled manner with a large number of test cases. If this approach is taken, it may be necessary to plan for a number of days for their execution and possibly the exclusive use of a software testing environment with a stable software configuration over that time frame.
It may be efficient to schedule tests of fault tolerance (robustness) at the same time as failover tests or even certain security tests since the required test inputs (e.g., exception conditions raised by the operating system) may be common.
The operational acceptance test (OAT) level is typically where procedural tests for backup and restoration are conducted. These tests are best scheduled together with the staff that will be responsible for actually performing the specified procedures in production.
Finally, the scheduling of any reliability tests (but in particular, failover tests) for a system of systems can present a technical and managerial challenge that should not be underestimated, especially if one or more components are outside of our direct control.
No comments:
Post a Comment