Explanation of Mitigation Effectiveness

Leave a comment

 

Since the release of the ISA 84.00.07 technical report, I’ve received a lot of enquiries regarding mitigation effectiveness.  A lot of the questions were framed in terms of “what number do I use for mitigation effectiveness?”, which shows that we the technical report working group did not do a good job in defining the concept.  Mitigation effectiveness is not a singular number that will allow FGS analysis to collapse down into a one-dimensional probabilistic problem.  On the contrary, the term mitigation effectiveness is used as a place holder for a large series of events, probabilities, and consequence magnitudes.  Mitigation effectiveness is actually an entire event tree of its own that describes what happens at the plant after a loss of containment event is detected and an FGS activates.

The following text is what I am proposing adding to the ISA 84.00.07 technical report to address this situation…

 

Annex E – Understanding the Mitigation Effectiveness Concept

Mitigation effectiveness is a complex concept that is used as a shorthand to encapsulate a wide range of factors that define the amount of risk reduction that an FGS function can provide.  The FGS effectiveness model, shown in Figure E.1, represents mitigation effectiveness as a single branch in an event tree.  This representation can lead to the interpretation that mitigation effectiveness is a single probabilistic value that when obtained will allow the analysis of FGS to collapse into a simple probabilistic calculation.  This interpretation is not correct.  While there is a value in demonstrating mitigation effectiveness as a single value in the FGS effectiveness model in order to illustrate general risk concepts, in reality mitigation effectiveness cannot be collapsed into a single value.  Instead, if modeled quantitatively, mitigation effectiveness is a large collection of event tree branches that describe the range of mitigation actions that are possible upon detection of a fire or gas event and the probability of success of each of these actions coupled with the amount of risk reduction that is provided under each scenario.

Image

Figure E.1              FGS Effectiveness Model

In order to better illustrate the concept of mitigation effectiveness, consider an example FGS function.  The example process is a natural gas compressor station consisting of an enclosed compressor building containing a single compressor.  The compressor station is equipped with optical fire detectors that will, upon detection of a fire, activate a chemical fire suppressant system that is designed to extinguish the fire.  The compressor station is controlled and maintained by two staff members who are primarily located in a control room that is located adjacent to the compressor building.  The layout off the facility is shown in Figure E.2.

Image

Figure E.2              Example Compressor Station Layout

For the case of a small incipient seal fire, the FGS effectiveness model would be populated with the frequency of the seal for as the value for the loss of containment.  The detector coverage would be quantified utilizing the scenario coverage for fires, as calculated in the detector coverage assessment, and the FGS safety availability would be calculated based on the average probability of failure on demand of the function, including sensors, logic solver, and dry chemical system.  For purposes of this example, assume that the achieved coverage is 80% and the achieved safety availability is 90% (as shown in Figure E.1).  This leaves only mitigation effectiveness undefined.

The effectiveness of activation of the FGS is not a simple probability that the dry chemical system puts the fire out.  Instead it is a complex combination of mechanical and human interactions.  Some of the factors that will determine that amount of mitigation that is achieved will include the following factors:

  1. What is the probability that the dry chemical system will extinguish the fire?  This probability is a function of the size of the fire that occurs and other contributing factors.  Failure of the dry chemical system to extinguish the fire include could be caused by:
    1. A fire size in excess of the design basis
    2. Excessive HVAC action removing the chemical agent too rapidly
    3. Doors left open preventing the chemical agent from properly accumulating
    4. Other factors
  2. If the automatic fire extinguishment system fails, will an operations staff member manually extinguish the fire with handheld equipment?
  3. If the automatic fire extinguishment system fails and operations staff manually attempts to control the fire, will they be injured during the process?
  4. If the automatic fire extinguishment system does effectively operate, will operations staff still be injured as the result of entering the room prior to ventilation of extinguishing chemicals and combustion by products.

This complex series of events that is represented by a single mitigation effectiveness value in the FGS effectiveness model could be represented by the following event tree to more accurately portray the depth and complexity of the mitigation effectiveness concept.

Image

Figure E.3              Event Tree Representing Mitigation Effectiveness

When selecting performance targets, the way that mitigation effectiveness is addressed depends on whether fully quantitative or semi-quantitative methods are employed.  In the fully quantitative approach, calculation of risk is done using an event tree to quantify all of the potential outcomes of a loss of containment accident.  In order to address mitigation effectiveness, all of the factors upon which the potential consequences rely are explicitly included in the event tree.  This would require that the type of information defining mitigation effectiveness as shown in Figure E.3 would need to be included in the overall FGS effectiveness model as shown in Figure E.1.  Note that the event tree shown in Figure E.3 is only a simplified example.

In semi-quantitative approaches, the mitigation effectiveness is considered during the calibration of the charts, tables, and numerical criteria that make up the overall procedure.  No additional explicit consideration of mitigation effectiveness is performed.

FGS Functions that can be Treated as SIF

Leave a comment

As I am currently preparing material for a revision of the ISATR 84.00.07 technical report, I will try to share with my readers some of the issues that users of the technical report have requested be included in an update to the TR.

When I last attended an IEC 60079 committee meeting, the members of the committee expressed some concern about the technical report specifically related to the situation where FGS functions should be treated directly as SIF, without any consideration of detector coverage and mitigation effectiveness.  Upon a short amount of consideration, I was able to develop a large number of situations where the analysis of a SIF collapses down into the simple assessment of a preventive SIF.  In order to highlight that some SIF do not require selection of coverage targets or the consideration of mitigative effectiveness during their design lifecycle (making them simple preventive SIF that only requiring design as per IEC 61511) I prepare an example to illustrate this concept.  My expectation is that the below example will be included in Annex C of the next version of ISA TR 84.00.07 to illustrate situations where the methods in the technical report are not required, and direct analysis as per IEC 61511 is the appropriate approach.

The example is as follows:

Some applications of a FGS should be treated identically to a safety instrumented function, in accordance with IEC 61511.  This type of application occurs when the detector coverage and mitigation effectiveness of the FGS function are 100%.  If the only risk attribute of the FGS effectiveness that is not 100% is the safety availability, then the FGS function shall be treated as a preventive SIF.

Consider the example of a valve shelter house.  Process facilities that are exposed to extreme environmental conditions, such as arctic oil and gas production, require the use of shelter to protect process equipment.  One such application is the use of a valve shelter house to protect critical valves from low temperatures and other environmental stressors.  A hazard posed by the use of such shelter houses is that they prevent the dissipation of fugitive emissions from valve packing, potentially allowing dangerous levels of toxic compounds such as hydrogen sulfide to accumulate in the shelter.  If operations personnel enter the shelter house while high concentrations of toxins are present, they may be harmed.

In order to protect against this hazard, some operating companies employ a FGS function that will lock the shelter door and activate a visible alarm at the door upon detection of a high concentration of toxin inside the shelter.  In this application, all components of the FGS effectiveness other than the safety availability are 100%, and as such, the FGS function should be treated as a preventive SIF which does not require consideration of detector effectiveness or complex risk analysis methods that consider mitigation effectiveness.  The detector coverage in this example is 100%.  An H2S detector located inside the shelter house will have 100% coverage, as any leak from the valve will accumulate in the shelter house allowing detection in virtually any installed location.  The mitigation effectiveness in this case is also 100%.  The means by when personnel will be harmed is opening the shelter door and subsequently inhaling hydrogen sulfide.  If the FGS function performs, it will prevent personnel from opening the door, completely preventing any consequence from occurring upon successful activation of the FGS function.

Given that the only FGS Effectiveness attribute that is not 100% is safety availability, selection of performance targets can be simplified to common approaches for SIL selection in accordance with IEC 61511, as described in IEC 61511 part 3.

Accuracy of SIL calculations

Leave a comment

During the recent ISA 84 committee meetings related to the ISA TR84.00.02 which discusses SIL verification calculations I was made aware of an effort to attempt to quantify the amount of error in SIL calculations.  The objective of this effort was to determine how much of a margin of error should be placed in the acceptance of a SIL verification calculation.  For instance, if a SIL 2 function is desired and the calculation shows a risk reduction factor of 102 was achieved, is that good enough?  The theory being proposed is that you should establish a limit on what RRF value is acceptable based on the amount of error that is present.  So, for instance, if you determine that your SIL verification calculation has an error of +/- 5, then a calculation of an RRF of 102 is really an RRF of between 97 and 107, since the 97 does not achieve the SIL 2 target you should modify the design until the full range, including worst case error, is within the SIL band.

Sounds reasonable right?

Not to me.

We’ve already larded up the SIL verification process with so many safety factors that adding another one here is going to cross from very conservative over to comical.  Additionally, this approach violates the spirit and philosophy of how we have performed SIL verification calculations since the advent of IEC 61508.  Engineers, in general, are taught to perform rigorous calculations to obtain precise numbers.  While this works well for things that can be known precisely, such as temperatures, pressures and flow rates, it is not realistic for risk.  As a risk analyst, you must have a different and more humble approach.  When performing a risk calculation, you give up on the concept of knowing something precisely, and instead, set boundaries with a degree of confidence.  As risk analyst, you don’t say that I KNOW that the frequency of an accident is precisely 1.53E-3 per year, instead you say that I am CONFIDENT that the frequency is less than 1.53E-3.  A subtle, but very important distinction.  In one case you are claiming precision that risk analysis can never really have, in the other you are setting a boundary that you are confident will not be violated.

SIL verification calculations, since their inception, have used this approach of setting a confidence boundary.  In IEC 61508 (and the current version of IEC 61511) there are several references to a 70% single-sided confidence limit when determining failure rates.  When using this approach, you are essentially saying that, for an instrument, I am confident (to the degree of 70%) that the failure rate is below a certain number.  Again, this is different from claiming that I know exactly what the failure rate is.  It is this 70% confidence limit that is now, and always has been the “margin of safety” factor employed to ensure that SIS designs are conservative and include a conservative factor to account for uncertainty in numbers.  Adding more uncertainty analysis is unnecessary and counter-productive.

Are separate taps making your plant LESS safe?

Leave a comment

Instrumentation and control engineers have been taught, no, conditioned through repetition, to design separate taps for each instrument that is associated with a safety instrumented function. The core idea is to minimize the probability of a common cause failure that will cause multiple and otherwise independent portions of a safety instrumented function from failing at the same time from a single stressor. It makes sense. So much so that most engineers, myself included, haven’t given it a second thought. But is it really safer?

Today (10 Nov 2013) I am attending the ADIPEC conference in Abu Dhabi and have just come from a session related to managing risks in sour gas operations where I presented a paper on use of scenario coverage mapping in the design of H2S detection systems in sulfur recovery units. Another speaker in the session was Alfred Kruijer, a Principal Technology Engineer with Shell. His paper, entitled, “Leak Path Reduction in High-Sour Plant Design”, caused me to rethink the idea of separate taps.

Summarizing Alfred’s recommendations, which are given from the point of view of a mechanical engineer who is trying to prevent leaks, plants are safer when there are fewer “joints” in the pressure containing equipment. He presented statistics that he had gathered that indicated that >93% of leaks were not the result of erosion, corrosion, or other mechanism that caused degradation of the pressure containing material, but the failure of joints in pressure containing equipment. In order to reduce leak frequency, the number of joints needs to be reduced. How do you reduce the number of joints? Well, one of the ways is to reduce the number of instrument taps that you have by combining them… Advice that is diametrically opposed to what us instrumentation and control engineers have been conditioned to believe.

So who’s right? That’s a good question that needs some further exploration. While I don’t have the answer at the moment, I do know the approach to use to solve the problem. You simply calculate the expected value of loss for both cases and apply the design with the lower expected value of loss. The expected value of loss is the consequence – put in numerical terms – multiplied by the frequency. So, you need to calculate the consequence and frequency of a leak of the instrument tap and compare that against the consequence and frequency of an incident that would occur as the result of a common cause tap plugging failure. As I said, I don’t have the numbers prepared, but my gut tells me that the leak rate of a separate tap is going to be higher than the common cause failure rate (let alone the resulting accident) of plugged taps in most relatively clean services – making our common design practice completely wrong.

I promise to do some more digging into this issue with numbers. Stay tuned…

Wireless for SIS?

Leave a comment

I recently received a request from a colleague asking for more information regarding the position of the ISA 84 committee on the use of wireless in SIS applications.  He had heard that the “rules” preventing the use of wireless in SIS had been relaxed.  This is my reply…

There are no changes in stance of ISA 84 committee at this time (05 Nov 2013).  Of course, the ISA committee on wireless is only putting together “Technical Reports”, and are not developing material that is providing normative requirements in the same scope of IEC 61508, which generally defines this field of study.  Contrary to most discussion, the core standard (IEC 61508) has never disallowed wireless.  The notion that wireless is forbidden was never true, it was only generally assumed based on most people’s gut reaction to the use of (at this point in time) unproven wireless systems in critical safety applications.  Most of the safety communication protocols that are used by equipment vendors are “medium agnostic” meaning that it really doesn’t matter what the signal travels on because all of the safety is in the sending mechanism and the receiving mechanism.  The sending and receiving equipment are equipment with elaborate and comprehensive diagnostics.  Failures of wireless systems are virtually 100% detectable in a millisecond time frame, as such, safety is not an issue at all.
 
There are two real reasons why people don’t currently use wireless for safety (much).
  1. No vendor (that I am aware of) has engineered and certified a wireless solution.  General purpose wireless solutions are not designed in accordance with IEC 61508-2, 3, and as such are not allowed in safety applications.  A safety “certified” set of equipment is not available to my knowledge.  You can’t just put a cisco wireless router from Best Buy in the middle of a safety loop, it would need to come from the vendor of a complete solution.
  2. Nuisance trips.  While failures are very detectable, they generally need to result in a vote to trip.  At this point in time, wireless failures are so frequent that the impact of nuisance shutdowns precludes the use of wireless systems in most SIS applications.

OSHA Review Panel Ruling on Application of API556

Leave a comment

Many SIS practitioners are concerned about how government and regulators interpret the implementation of codes and standards that are relevant to their industries. Concrete guidance is rarely given beyond what is strictly written in the regulations (e.g., 29 CFR 1910.119) and occasional letters of interpretation. OSHA Citations offer some insight, but are not reliable information as they are merely allegations, not convictions. More substantive information in terms of rulings and judgments is harder to come by, and when it is available the information is always important to know. Due to some very large fines that have recently been levied by OSHA, operating companies are beginning to fight back instead of settling to minimize their legal costs.

One such case has recently closed. The OSHA review commission has very recently released a ruling OSHRC Docket No. 10-0637 – Secretary of Labor, Complainant, BP Products North America, Inc., and BP-Husky Refining, LLC, respondent. This particular ruling should be of great interest to the SIS practitioner community because a series of the citations were related to failure to comply with the API 556 standard for safety instrumented functions on fired heaters. Overall, the original citation proposed fines of almost $3 million. After the ruling was handed out, most of the citations were vacated and the final ruling was for $35,000. See the PDF copy of the ruling by clicking on the link on this page.

BP Decision and Order – 10-0637

With respect to SIS, the interesting portion of this ruling is related to the allegation that the Respondent did not comply with RAGAGEP because the API 556, which was considered RAGAGEP contained a list of recommended shutdown functions, all of which were not included in the design of the heater. In the ruling, the discussion of these items begins on page 33 where the section is titled Items 28, 29, and 30 where these items numbers correspond to the citation numbers. The citation alleged that RAGAGEP had not been followed because all of the shutdowns shown in Table 1 – Alarms and Shutdown Initiators were not present in the design of the equipment that was audited. The respondent’s counsel and expert witness made the case that all of the shutdowns that were on the list need not be installed. They argued that the standard provides a list of functions that should be considered, but if risk analysis demonstrates that these functions are not required, they need not be installed.

The arguments and the judges discussion of the arguments follow on pages 33 through 37. Ultimately, the judge ruled that the OSHA interpretation that ALL of the safeguards in API 556 must be implemented was NOT correct, and that the respondent’s interpretation that the items on the list must be considered and implemented (or not) based on the results of risk analysis. Based on this assessment, all of the OSHA citations were vacated, no fines were levied for these items.

This ruling has important ramifications. First off, it clarifies that should means should, not shall. When an industry sector standard recommends that a list of shutdowns should be installed, that means that they should be considered through the risk analysis process and implemented if that risk analysis process agrees that they are necessary. But if that risk assessment process shows that they are not necessary, they do not need to be installed. This thought process should also be carried out to other standards that make similar recommendations, such as the API 616-619 series that make recommendations for SIS on rotating equipment. And unlike speculation based on words in standards, we now have an actual ruling from a Judge to assist in the decision making process.

Credit for Operator Response to SIS Failure?

Leave a comment

Even though I have been doing SIS design for about 20 years now, I never seem to run out of new and interesting problems and questions to ponder.  Of course, this is probably a function of the fact that I am a consultant who is continually exposed to different processes and usually only get involved when the problems are complex… 

On a recent project I was faced with a dilemma of when to allow credit for testing and repair of failed SIS components.  The question, as posed, is deceptively simple.  There are slots in your standard PFD equations for test interval and repair time, it would seem obvious that you always take credit for them.  In reality, I have determined that this is not always the case.  What you really need to consider is when and why did the failure evidence itself, and what action is being taken in response to the failure to return to a safe state.  While I can’t share the details of the specific project I am working on, I will provide you with an analogy where what is the right and wrong thing to do are much more obvious.

There are numerous human interactions with the SIS that are considered when performing SIL verification calculations.  But there is a limit to the beneficial effect that can be credited with human involvement.  The requirements for performing SIL verification calculations are presented in IEC 61511 in clause 11.9 – SIF Probability of Failure.  This clause begins with sub-clause 11.9.1 which states, “ The probability of failure on demand of each safety instrumented function shall be equal to, or less than, the target failure measure as specified in the safety requirements specifications.  This shall be verified by calculations”.  This is the clause that essentially says that a calculation must be done, and that calculation must achieve the specified target.  The next clause lists that things that must be considered when performing the calculations, and states, “The calculated probability of failure of each safety instrumented function due to hardware failures shall take into account”, and then proceeds to give a list of attributes of the hardware design that should be considered.  What is important to note here is that the clause stresses that the probability of failure on demand is a function of the “random hardware failures” and does not take into consideration human actions other than human actions that cause the SIF to fail.

When performing SIL verification calculations it is customary and proper to consider testing and maintenance activity, and the beneficial effect it has on the availability of the shutdown function.  When a test is performed before a demand and that tests evidences a failure which is repaired.  The probability of the SIF being operational when the actual demand comes is higher.  What is not customary is to consider the manual response to a failed SIF to be part of the SIF with respect to calculations.  What we calculate with SIL is the probability that the hardware system will operate.  What is not included is the probability that the operator will manually get the process to a safe state even if the SIF fails. 

Let me give an example that I hope will help to illustrate when human intervention can and cannot be considered when calculating the SIL.  Consider an oil separator on an oil production platform that separates oil from gas.  Let’s say that there is a high level switch in the separator that will close an inlet shutoff valve to the separator to prevent overfilling the separator.  Let’s also assume that the inlet shutoff valve has a limit switch to determine if the valve closed upon command.  If this high level shutoff is tested on an annual basis, and any failures are repaired, then this testing and repair activity are considered when performing SIL verification calculations.  The reason is that these actions increase the probability that the SIF will work when commanded.  If, on the other hand, there is a high level situation in the vessel which commands the valve to go closed, but it doesn’t, then the SIF has failed.  At this point in time, it is reasonable to expected that the operator will see the limit switch alarm on the shutoff valve and take a manual action to close a separator control valve or even call out to the field to close a manual shutoff valve.  While these actions will in fact reduce risk, and can be reasonably expected to occur – they are NOT to be considered as part of the SIF.  In this example the SIF (specifically, the hardware that comprises the SIF) has failed!  Just because there is an operator action that can have the same effect as the SIF does not mean that the SIF did not fail.

While the above statements may seem obvious for the particular situation that I have presented above, not all SIF are so simple and obvious.  The overall general rule that I would present at this time is that credit for testing and repair should only be taken if the testing which evidenced the failure resulted in a repair of the SIF, and that the SIF hardware will return the process to the safe state when required.  Manual actions that can have the same effect as the SIF that are taken should not be considered when calculating the achieved SIL of a SIF.

Older Entries

Follow

Get every new post delivered to your Inbox.

Join 70 other followers

%d bloggers like this: