Can I Increase SIS Test Intervals to Seven Years?

Leave a comment

I recently received a question from an operating company engineer.  He was asked to confirm that the safety instrumented systems were suitable for increasing the test interval up to seven years from a current figure of five years.  Even though he believed that the calculations would show that the increased test interval was acceptable, he was hesitant to make the drastic two year shift based on simple gut-feel discomfort.  I was able to give him some more technical basis for why his gut was telling him that the increase did not feel right even though the “perfect math” of SIL verification calculations might have been able to justify it.


As you know, refineries are always trying to extend the duration of turnarounds in order to minimize expense.  In doing so, they are also increasing the time interval between SIS tests, if tests are only possible during the shutdown that the turnaround provides.  While it might seem that the determination of whether or not these extended intervals are acceptable is a simply matter of re-running the SIL verification calculations with a different test interval, the reality is a bit more complex.

SIL verification calculations depend on failure rates for SIS equipment items.  The data that we use for those failure rates is often collected from actual operating SIS equipment from the field that has been compiled into databases such as OREDA and NPRD.  The database simply shows a single (constant) failure rate for a  device, implying that the single number is an attribute of that specific type of device, but again, the truth is much more complex.

When we collect and use data for failure rate calculations we are making two fundamental assumptions that might not be obvious to all persons who perform SIL verification calculations.  These assumptions are:

  1. Constant Failure Rate
  2. Well designed and well maintained equipment

The first assumption is that the failure rate of an instrument is constant over its entire lifetime.  This assumption, stated another way, implies that the probability of a device failing in year number one is exactly the same as it failing in year 2, 5, 10 or 20.  While the constant failure rate assumption is fairly valid for electronic equipment during its useful life (i.e., after burn-in but before wear out failures start to occur, usually about 10 years after fabrication of the equipment), it is less valid for equipment with wearable moving parts, such as a valve.  As we collect data, we generally do not distinguish when the failure occurred in relation to the installation of the equipment item, so databases will generally create failure rates that are representative of equipment items that are of all of the ages that are typically used.  As such, if most operating companies are turning around tests (and also performing maintenance) at a 5 year interval, then the databases that we use for SIL verification calculations reflect SIS instruments that are in service for up to five years.

If a user goes outside the typical turnaround intervals, increasing intervals to 6 or 7 years, then SIL verifications based on SIS instruments with shorter test intervals do not accurately represent the (increased) failure rates that might be expected as the between-testing and between-maintenance intervals are increased – including instruments that are 6 and 7 years beyond their last test/maintenance.  As such, engineering judgment would indicate that the using typical failure rate data is too aggressive of a stance, but common data based on the increased test intervals is not yet available.

The assumption about the well designed and well maintained system also comes into play in a very similar way.  If users of an instrument perform a routine maintenance task, such as greasing a bearing, replacing packing, or replacing seals is performed at every turnaround, and a failure of those components can cause a failure of the SIS, then the failure rate data is critically dependent on that maintenance actions occurring at the five year interval.  If the maintenance activity does not occur until 6 or 7 years after the start of a run one can infer that the failure rates (especially as the devices reach the 6th and 7th year) will increase.  Since the bulk of industry, from which the typically failure rate data is derived, is performing its maintenance activities on the shorter 4-5 year interval, it again can be inferred that as the test interval is increased to 6 or 7 years, but the data from 4-5 year interval maintenance is used for failure rates, the PFD calculations to verify achieved SIL will also be in error, and in an aggressive non-conservative way.

Unfortunately, making a large increase in test intervals, especially in comparison to what industry peers are doing, may result in non-conservative SIL verification calculations whose data does not accurately represent the operating regime that the plant is in after the test intervals have been increased.  In order to prudently increase test intervals, the between-testing interval needs to be increased more slowly – perhaps one half year at a time, to give time for the actual failure rate data that is being collected by the plant and by industry as a whole to catch up with the new operating profiles of the plant.

Explanation of Mitigation Effectiveness

Leave a comment


Since the release of the ISA 84.00.07 technical report, I’ve received a lot of enquiries regarding mitigation effectiveness.  A lot of the questions were framed in terms of “what number do I use for mitigation effectiveness?”, which shows that we the technical report working group did not do a good job in defining the concept.  Mitigation effectiveness is not a singular number that will allow FGS analysis to collapse down into a one-dimensional probabilistic problem.  On the contrary, the term mitigation effectiveness is used as a place holder for a large series of events, probabilities, and consequence magnitudes.  Mitigation effectiveness is actually an entire event tree of its own that describes what happens at the plant after a loss of containment event is detected and an FGS activates.

The following text is what I am proposing adding to the ISA 84.00.07 technical report to address this situation…


Annex E – Understanding the Mitigation Effectiveness Concept

Mitigation effectiveness is a complex concept that is used as a shorthand to encapsulate a wide range of factors that define the amount of risk reduction that an FGS function can provide.  The FGS effectiveness model, shown in Figure E.1, represents mitigation effectiveness as a single branch in an event tree.  This representation can lead to the interpretation that mitigation effectiveness is a single probabilistic value that when obtained will allow the analysis of FGS to collapse into a simple probabilistic calculation.  This interpretation is not correct.  While there is a value in demonstrating mitigation effectiveness as a single value in the FGS effectiveness model in order to illustrate general risk concepts, in reality mitigation effectiveness cannot be collapsed into a single value.  Instead, if modeled quantitatively, mitigation effectiveness is a large collection of event tree branches that describe the range of mitigation actions that are possible upon detection of a fire or gas event and the probability of success of each of these actions coupled with the amount of risk reduction that is provided under each scenario.


Figure E.1              FGS Effectiveness Model

In order to better illustrate the concept of mitigation effectiveness, consider an example FGS function.  The example process is a natural gas compressor station consisting of an enclosed compressor building containing a single compressor.  The compressor station is equipped with optical fire detectors that will, upon detection of a fire, activate a chemical fire suppressant system that is designed to extinguish the fire.  The compressor station is controlled and maintained by two staff members who are primarily located in a control room that is located adjacent to the compressor building.  The layout off the facility is shown in Figure E.2.


Figure E.2              Example Compressor Station Layout

For the case of a small incipient seal fire, the FGS effectiveness model would be populated with the frequency of the seal for as the value for the loss of containment.  The detector coverage would be quantified utilizing the scenario coverage for fires, as calculated in the detector coverage assessment, and the FGS safety availability would be calculated based on the average probability of failure on demand of the function, including sensors, logic solver, and dry chemical system.  For purposes of this example, assume that the achieved coverage is 80% and the achieved safety availability is 90% (as shown in Figure E.1).  This leaves only mitigation effectiveness undefined.

The effectiveness of activation of the FGS is not a simple probability that the dry chemical system puts the fire out.  Instead it is a complex combination of mechanical and human interactions.  Some of the factors that will determine that amount of mitigation that is achieved will include the following factors:

  1. What is the probability that the dry chemical system will extinguish the fire?  This probability is a function of the size of the fire that occurs and other contributing factors.  Failure of the dry chemical system to extinguish the fire include could be caused by:
    1. A fire size in excess of the design basis
    2. Excessive HVAC action removing the chemical agent too rapidly
    3. Doors left open preventing the chemical agent from properly accumulating
    4. Other factors
  2. If the automatic fire extinguishment system fails, will an operations staff member manually extinguish the fire with handheld equipment?
  3. If the automatic fire extinguishment system fails and operations staff manually attempts to control the fire, will they be injured during the process?
  4. If the automatic fire extinguishment system does effectively operate, will operations staff still be injured as the result of entering the room prior to ventilation of extinguishing chemicals and combustion by products.

This complex series of events that is represented by a single mitigation effectiveness value in the FGS effectiveness model could be represented by the following event tree to more accurately portray the depth and complexity of the mitigation effectiveness concept.


Figure E.3              Event Tree Representing Mitigation Effectiveness

When selecting performance targets, the way that mitigation effectiveness is addressed depends on whether fully quantitative or semi-quantitative methods are employed.  In the fully quantitative approach, calculation of risk is done using an event tree to quantify all of the potential outcomes of a loss of containment accident.  In order to address mitigation effectiveness, all of the factors upon which the potential consequences rely are explicitly included in the event tree.  This would require that the type of information defining mitigation effectiveness as shown in Figure E.3 would need to be included in the overall FGS effectiveness model as shown in Figure E.1.  Note that the event tree shown in Figure E.3 is only a simplified example.

In semi-quantitative approaches, the mitigation effectiveness is considered during the calibration of the charts, tables, and numerical criteria that make up the overall procedure.  No additional explicit consideration of mitigation effectiveness is performed.

FGS Functions that can be Treated as SIF

Leave a comment

As I am currently preparing material for a revision of the ISATR 84.00.07 technical report, I will try to share with my readers some of the issues that users of the technical report have requested be included in an update to the TR.

When I last attended an IEC 60079 committee meeting, the members of the committee expressed some concern about the technical report specifically related to the situation where FGS functions should be treated directly as SIF, without any consideration of detector coverage and mitigation effectiveness.  Upon a short amount of consideration, I was able to develop a large number of situations where the analysis of a SIF collapses down into the simple assessment of a preventive SIF.  In order to highlight that some SIF do not require selection of coverage targets or the consideration of mitigative effectiveness during their design lifecycle (making them simple preventive SIF that only requiring design as per IEC 61511) I prepare an example to illustrate this concept.  My expectation is that the below example will be included in Annex C of the next version of ISA TR 84.00.07 to illustrate situations where the methods in the technical report are not required, and direct analysis as per IEC 61511 is the appropriate approach.

The example is as follows:

Some applications of a FGS should be treated identically to a safety instrumented function, in accordance with IEC 61511.  This type of application occurs when the detector coverage and mitigation effectiveness of the FGS function are 100%.  If the only risk attribute of the FGS effectiveness that is not 100% is the safety availability, then the FGS function shall be treated as a preventive SIF.

Consider the example of a valve shelter house.  Process facilities that are exposed to extreme environmental conditions, such as arctic oil and gas production, require the use of shelter to protect process equipment.  One such application is the use of a valve shelter house to protect critical valves from low temperatures and other environmental stressors.  A hazard posed by the use of such shelter houses is that they prevent the dissipation of fugitive emissions from valve packing, potentially allowing dangerous levels of toxic compounds such as hydrogen sulfide to accumulate in the shelter.  If operations personnel enter the shelter house while high concentrations of toxins are present, they may be harmed.

In order to protect against this hazard, some operating companies employ a FGS function that will lock the shelter door and activate a visible alarm at the door upon detection of a high concentration of toxin inside the shelter.  In this application, all components of the FGS effectiveness other than the safety availability are 100%, and as such, the FGS function should be treated as a preventive SIF which does not require consideration of detector effectiveness or complex risk analysis methods that consider mitigation effectiveness.  The detector coverage in this example is 100%.  An H2S detector located inside the shelter house will have 100% coverage, as any leak from the valve will accumulate in the shelter house allowing detection in virtually any installed location.  The mitigation effectiveness in this case is also 100%.  The means by when personnel will be harmed is opening the shelter door and subsequently inhaling hydrogen sulfide.  If the FGS function performs, it will prevent personnel from opening the door, completely preventing any consequence from occurring upon successful activation of the FGS function.

Given that the only FGS Effectiveness attribute that is not 100% is safety availability, selection of performance targets can be simplified to common approaches for SIL selection in accordance with IEC 61511, as described in IEC 61511 part 3.

Accuracy of SIL calculations

Leave a comment

During the recent ISA 84 committee meetings related to the ISA TR84.00.02 which discusses SIL verification calculations I was made aware of an effort to attempt to quantify the amount of error in SIL calculations.  The objective of this effort was to determine how much of a margin of error should be placed in the acceptance of a SIL verification calculation.  For instance, if a SIL 2 function is desired and the calculation shows a risk reduction factor of 102 was achieved, is that good enough?  The theory being proposed is that you should establish a limit on what RRF value is acceptable based on the amount of error that is present.  So, for instance, if you determine that your SIL verification calculation has an error of +/- 5, then a calculation of an RRF of 102 is really an RRF of between 97 and 107, since the 97 does not achieve the SIL 2 target you should modify the design until the full range, including worst case error, is within the SIL band.

Sounds reasonable right?

Not to me.

We’ve already larded up the SIL verification process with so many safety factors that adding another one here is going to cross from very conservative over to comical.  Additionally, this approach violates the spirit and philosophy of how we have performed SIL verification calculations since the advent of IEC 61508.  Engineers, in general, are taught to perform rigorous calculations to obtain precise numbers.  While this works well for things that can be known precisely, such as temperatures, pressures and flow rates, it is not realistic for risk.  As a risk analyst, you must have a different and more humble approach.  When performing a risk calculation, you give up on the concept of knowing something precisely, and instead, set boundaries with a degree of confidence.  As risk analyst, you don’t say that I KNOW that the frequency of an accident is precisely 1.53E-3 per year, instead you say that I am CONFIDENT that the frequency is less than 1.53E-3.  A subtle, but very important distinction.  In one case you are claiming precision that risk analysis can never really have, in the other you are setting a boundary that you are confident will not be violated.

SIL verification calculations, since their inception, have used this approach of setting a confidence boundary.  In IEC 61508 (and the current version of IEC 61511) there are several references to a 70% single-sided confidence limit when determining failure rates.  When using this approach, you are essentially saying that, for an instrument, I am confident (to the degree of 70%) that the failure rate is below a certain number.  Again, this is different from claiming that I know exactly what the failure rate is.  It is this 70% confidence limit that is now, and always has been the “margin of safety” factor employed to ensure that SIS designs are conservative and include a conservative factor to account for uncertainty in numbers.  Adding more uncertainty analysis is unnecessary and counter-productive.

Are separate taps making your plant LESS safe?

Leave a comment

Instrumentation and control engineers have been taught, no, conditioned through repetition, to design separate taps for each instrument that is associated with a safety instrumented function. The core idea is to minimize the probability of a common cause failure that will cause multiple and otherwise independent portions of a safety instrumented function from failing at the same time from a single stressor. It makes sense. So much so that most engineers, myself included, haven’t given it a second thought. But is it really safer?

Today (10 Nov 2013) I am attending the ADIPEC conference in Abu Dhabi and have just come from a session related to managing risks in sour gas operations where I presented a paper on use of scenario coverage mapping in the design of H2S detection systems in sulfur recovery units. Another speaker in the session was Alfred Kruijer, a Principal Technology Engineer with Shell. His paper, entitled, “Leak Path Reduction in High-Sour Plant Design”, caused me to rethink the idea of separate taps.

Summarizing Alfred’s recommendations, which are given from the point of view of a mechanical engineer who is trying to prevent leaks, plants are safer when there are fewer “joints” in the pressure containing equipment. He presented statistics that he had gathered that indicated that >93% of leaks were not the result of erosion, corrosion, or other mechanism that caused degradation of the pressure containing material, but the failure of joints in pressure containing equipment. In order to reduce leak frequency, the number of joints needs to be reduced. How do you reduce the number of joints? Well, one of the ways is to reduce the number of instrument taps that you have by combining them… Advice that is diametrically opposed to what us instrumentation and control engineers have been conditioned to believe.

So who’s right? That’s a good question that needs some further exploration. While I don’t have the answer at the moment, I do know the approach to use to solve the problem. You simply calculate the expected value of loss for both cases and apply the design with the lower expected value of loss. The expected value of loss is the consequence – put in numerical terms – multiplied by the frequency. So, you need to calculate the consequence and frequency of a leak of the instrument tap and compare that against the consequence and frequency of an incident that would occur as the result of a common cause tap plugging failure. As I said, I don’t have the numbers prepared, but my gut tells me that the leak rate of a separate tap is going to be higher than the common cause failure rate (let alone the resulting accident) of plugged taps in most relatively clean services – making our common design practice completely wrong.

I promise to do some more digging into this issue with numbers. Stay tuned…

Wireless for SIS?

Leave a comment

I recently received a request from a colleague asking for more information regarding the position of the ISA 84 committee on the use of wireless in SIS applications.  He had heard that the “rules” preventing the use of wireless in SIS had been relaxed.  This is my reply…

There are no changes in stance of ISA 84 committee at this time (05 Nov 2013).  Of course, the ISA committee on wireless is only putting together “Technical Reports”, and are not developing material that is providing normative requirements in the same scope of IEC 61508, which generally defines this field of study.  Contrary to most discussion, the core standard (IEC 61508) has never disallowed wireless.  The notion that wireless is forbidden was never true, it was only generally assumed based on most people’s gut reaction to the use of (at this point in time) unproven wireless systems in critical safety applications.  Most of the safety communication protocols that are used by equipment vendors are “medium agnostic” meaning that it really doesn’t matter what the signal travels on because all of the safety is in the sending mechanism and the receiving mechanism.  The sending and receiving equipment are equipment with elaborate and comprehensive diagnostics.  Failures of wireless systems are virtually 100% detectable in a millisecond time frame, as such, safety is not an issue at all.
There are two real reasons why people don’t currently use wireless for safety (much).
  1. No vendor (that I am aware of) has engineered and certified a wireless solution.  General purpose wireless solutions are not designed in accordance with IEC 61508-2, 3, and as such are not allowed in safety applications.  A safety “certified” set of equipment is not available to my knowledge.  You can’t just put a cisco wireless router from Best Buy in the middle of a safety loop, it would need to come from the vendor of a complete solution.
  2. Nuisance trips.  While failures are very detectable, they generally need to result in a vote to trip.  At this point in time, wireless failures are so frequent that the impact of nuisance shutdowns precludes the use of wireless systems in most SIS applications.

OSHA Review Panel Ruling on Application of API556

Leave a comment

Many SIS practitioners are concerned about how government and regulators interpret the implementation of codes and standards that are relevant to their industries. Concrete guidance is rarely given beyond what is strictly written in the regulations (e.g., 29 CFR 1910.119) and occasional letters of interpretation. OSHA Citations offer some insight, but are not reliable information as they are merely allegations, not convictions. More substantive information in terms of rulings and judgments is harder to come by, and when it is available the information is always important to know. Due to some very large fines that have recently been levied by OSHA, operating companies are beginning to fight back instead of settling to minimize their legal costs.

One such case has recently closed. The OSHA review commission has very recently released a ruling OSHRC Docket No. 10-0637 – Secretary of Labor, Complainant, BP Products North America, Inc., and BP-Husky Refining, LLC, respondent. This particular ruling should be of great interest to the SIS practitioner community because a series of the citations were related to failure to comply with the API 556 standard for safety instrumented functions on fired heaters. Overall, the original citation proposed fines of almost $3 million. After the ruling was handed out, most of the citations were vacated and the final ruling was for $35,000. See the PDF copy of the ruling by clicking on the link on this page.

BP Decision and Order – 10-0637

With respect to SIS, the interesting portion of this ruling is related to the allegation that the Respondent did not comply with RAGAGEP because the API 556, which was considered RAGAGEP contained a list of recommended shutdown functions, all of which were not included in the design of the heater. In the ruling, the discussion of these items begins on page 33 where the section is titled Items 28, 29, and 30 where these items numbers correspond to the citation numbers. The citation alleged that RAGAGEP had not been followed because all of the shutdowns shown in Table 1 – Alarms and Shutdown Initiators were not present in the design of the equipment that was audited. The respondent’s counsel and expert witness made the case that all of the shutdowns that were on the list need not be installed. They argued that the standard provides a list of functions that should be considered, but if risk analysis demonstrates that these functions are not required, they need not be installed.

The arguments and the judges discussion of the arguments follow on pages 33 through 37. Ultimately, the judge ruled that the OSHA interpretation that ALL of the safeguards in API 556 must be implemented was NOT correct, and that the respondent’s interpretation that the items on the list must be considered and implemented (or not) based on the results of risk analysis. Based on this assessment, all of the OSHA citations were vacated, no fines were levied for these items.

This ruling has important ramifications. First off, it clarifies that should means should, not shall. When an industry sector standard recommends that a list of shutdowns should be installed, that means that they should be considered through the risk analysis process and implemented if that risk analysis process agrees that they are necessary. But if that risk assessment process shows that they are not necessary, they do not need to be installed. This thought process should also be carried out to other standards that make similar recommendations, such as the API 616-619 series that make recommendations for SIS on rotating equipment. And unlike speculation based on words in standards, we now have an actual ruling from a Judge to assist in the decision making process.

Older Entries


Get every new post delivered to your Inbox.

Join 77 other followers

%d bloggers like this: