Credit for Operator Response to SIS Failure?

Leave a comment

Even though I have been doing SIS design for about 20 years now, I never seem to run out of new and interesting problems and questions to ponder.  Of course, this is probably a function of the fact that I am a consultant who is continually exposed to different processes and usually only get involved when the problems are complex… 

On a recent project I was faced with a dilemma of when to allow credit for testing and repair of failed SIS components.  The question, as posed, is deceptively simple.  There are slots in your standard PFD equations for test interval and repair time, it would seem obvious that you always take credit for them.  In reality, I have determined that this is not always the case.  What you really need to consider is when and why did the failure evidence itself, and what action is being taken in response to the failure to return to a safe state.  While I can’t share the details of the specific project I am working on, I will provide you with an analogy where what is the right and wrong thing to do are much more obvious.

There are numerous human interactions with the SIS that are considered when performing SIL verification calculations.  But there is a limit to the beneficial effect that can be credited with human involvement.  The requirements for performing SIL verification calculations are presented in IEC 61511 in clause 11.9 – SIF Probability of Failure.  This clause begins with sub-clause 11.9.1 which states, “ The probability of failure on demand of each safety instrumented function shall be equal to, or less than, the target failure measure as specified in the safety requirements specifications.  This shall be verified by calculations”.  This is the clause that essentially says that a calculation must be done, and that calculation must achieve the specified target.  The next clause lists that things that must be considered when performing the calculations, and states, “The calculated probability of failure of each safety instrumented function due to hardware failures shall take into account”, and then proceeds to give a list of attributes of the hardware design that should be considered.  What is important to note here is that the clause stresses that the probability of failure on demand is a function of the “random hardware failures” and does not take into consideration human actions other than human actions that cause the SIF to fail.

When performing SIL verification calculations it is customary and proper to consider testing and maintenance activity, and the beneficial effect it has on the availability of the shutdown function.  When a test is performed before a demand and that tests evidences a failure which is repaired.  The probability of the SIF being operational when the actual demand comes is higher.  What is not customary is to consider the manual response to a failed SIF to be part of the SIF with respect to calculations.  What we calculate with SIL is the probability that the hardware system will operate.  What is not included is the probability that the operator will manually get the process to a safe state even if the SIF fails. 

Let me give an example that I hope will help to illustrate when human intervention can and cannot be considered when calculating the SIL.  Consider an oil separator on an oil production platform that separates oil from gas.  Let’s say that there is a high level switch in the separator that will close an inlet shutoff valve to the separator to prevent overfilling the separator.  Let’s also assume that the inlet shutoff valve has a limit switch to determine if the valve closed upon command.  If this high level shutoff is tested on an annual basis, and any failures are repaired, then this testing and repair activity are considered when performing SIL verification calculations.  The reason is that these actions increase the probability that the SIF will work when commanded.  If, on the other hand, there is a high level situation in the vessel which commands the valve to go closed, but it doesn’t, then the SIF has failed.  At this point in time, it is reasonable to expected that the operator will see the limit switch alarm on the shutoff valve and take a manual action to close a separator control valve or even call out to the field to close a manual shutoff valve.  While these actions will in fact reduce risk, and can be reasonably expected to occur – they are NOT to be considered as part of the SIF.  In this example the SIF (specifically, the hardware that comprises the SIF) has failed!  Just because there is an operator action that can have the same effect as the SIF does not mean that the SIF did not fail.

While the above statements may seem obvious for the particular situation that I have presented above, not all SIF are so simple and obvious.  The overall general rule that I would present at this time is that credit for testing and repair should only be taken if the testing which evidenced the failure resulted in a repair of the SIF, and that the SIF hardware will return the process to the safe state when required.  Manual actions that can have the same effect as the SIF that are taken should not be considered when calculating the achieved SIL of a SIF.

How Many Alarm IPL Can I Credit?

Leave a comment

I recently received an enquiry from a customer.  He asked if I know of any technical resource that specifies a limitation on the number of alarms that can be considered as an IPL.  The rationale being that there comes a point where an operator is exposed to so many IPL alarms that he will not be able to properly respond to them.  This was quite a thought provoking questions because I had never considered the issue simply limited to IPL alarms.  As a member of the ISA 18 committee that developed the ISA 18.02 standard, which is currently in the process of being reviewed and adopted as an IEC standard, I had always considered to total number of alarms that the operator is exposed to, and not just the ones that have the additional classification of being an IPL.  As I explained, all alarms need to be responded to, and it is very likely the some non-IPL alarms will actually have a higher criticality than IPL alarms due to a short time available for response.  My reply is paraphrased below.

 

I’m not aware of any national or international standard, recommended practice, or technical report that will specifically set a limitation on only IPL alarms.  Generally speaking, limiting the consideration of alarm loading to only IPL alarms would not tell the whole story, as EVERY alarm needs to be responded to, not just the IPL alarms.  Furthermore, I suspect that many of the IPL alarms would actually get a lower priority than non-IPL alarms due to the ample time that is usually available to respond to IPL alarms, whereas other non-IPL alarms may have very short time frames available in which to perform a mitigative action.

With respect to metrics for alarm performance in general (not just IPL alarms) an operator can be considered ‘overloaded’ if they are receiving an alarm more than once every 5-10 minutes.  In general 5-10 minutes is what is required to respond to an alarm, so if alarms are being activated at a rate that is faster than the operator’s ability to respond, the operator will not be finished with the first alarm response before the second alarm comes in, and so on.  This then leads to an overall out of control situation.  Metrics for operating loading can be found in EEMUA 191 and ISA 18.02 (which is on its way to becoming IEC 62682).

Another “design phase” problem, is that there is really no way of know what the alarm annunciation rate will be ahead of time.  You can guess, based on the number of configured alarms, but the number and priority distribution of configured alarms almost never is consistent with actual annunciations from a running plant.  What is required, during operation, is the tracking of alarm activations and comparison against metrics to ensure that the alarm system is behaving as designed, and that the operator can handle the “real” load that is being presented to him.

As far as I’m concerned whether an alarm is IPL or not is not really relevant with respect to the design of the alarm system and decision making with respect to operator loading.  What is of most importance is that the operator can respond to the actual alarm rate that he is presented with, during normal operation and during upsets.  Establishing this is a matter of measuring alarm activations and comparing against metrics in EEMUA 191 and ISA 18.02.

I think I might have too many fire and gas detectors…

2 Comments

In my consulting I often hear engineers say, “I think I have too many detectors”.  They use their prescriptive rules, other rules of thumb, and basic engineering judgment to perform an initial layout and when the design is complete they look at the thousand or so detectors they’ve placed and determine that it just doesn’t feel right.

Unfortunately, in recent history and in many cases today’s design practices don’t give these engineers any objective criterion upon which to reduce the detector count, or even provide a rational yardstick to determine whether or not their count of detectors is reasonable given the type of facility that they are operating.

That has all changed.  Kenexis has been assisting our customers with performance based fire and gas designs for over 5 years now, and we have built up a significant library of studies from which we can pull statistics on what is typical of a “well done” design.  We have also now developed a methodology to review a facility very rapidly to develop a coarse (+/- 30%) count of detectors that are expected to be required if the facility were designed using the latest state-of-the art performance based fire and gas detection techniques.  Our process uses shortcut risk analysis methods that consider each major equipment item in terms of type, operating temperature, operating process, and composition of material contained in the equipment item, and determine whether or not we would expect that equipment item to require protection from an FGS.  We then apply a scaling factor (based on past experience) that ratios the number of expected detectors to the number of equipment items that are expected to require coverage by FGS.

Using this method we can quickly estimate the expected number of detectors of each type for a facility of almost any size, usually for less than $10,000 USD.  So whether you need an estimated number of detectors for FEED stage cost estimation purposes, a review of the design of an existing facility, or a sanity check of a proposed design of a new facility.  We can help in a quick and efficient manner.  Let us know how we can help you.

Applicability of NFPA 72 to Process Areas

Leave a comment

I received the following questions about design of fire and gas detection and suppression systems in process industry applications from a client. After answering I figured that I would share the information, as these questions are quite frequently asked.

(1)    Is the NFPA 72 a fire standard/guideline applicable to “only buildings’ ? Does any certifying body check for compliance with NFPA-72 in process areas (or for gas detection)?

The NFPA standard, or “The National Fire Alarm Code” is a general purpose standard that is primarily written for fire alarm equipment in occupied buildings.  It is really up to “the authority having jurisdiction” whether or not it is applicable to a certain facility.  In the US, that “authority” is the local Fire Marshall.  The Fire Marshall is a person that is associated with Local (city-level, possibly state-level) Fire First Responders.  Fire Marshalls typically randomly audit facilities in their jurisdiction to determine whether or not they are in compliance with standards that have been incorporated by reference by local law – and the National Fire Alarm Code is typically one of those – for instance, a fire marshal has recently stopped by the Kenexis building and checked our fire extinguishers.  In my experience, I have never seen a Fire Marshall apply the requirements of NFPA 72 to anything other than an occupied building.  Open process areas do not fall into the category of “occupied buildings” and I am not aware of any instance where anyone was cited by any authority for not following NPFA 72 in open process areas that are out-doors.  On the other hand, I am aware of Fire Marshalls that check for NFPA compliance in process areas that are located inside a building.  Even in these cases, only the fire detection system is in the scope of audit, and not gas detection which is not covered by the NPFA 72 standard.

The segregation between what is indoors and what is outdoors is so important, that it is not uncommon for an offshore platform or FPSO to have two separate FGS systems, one for the occupied buildings that is NFPA 72 compliant and a separate one for the open process areas that is NOT NPFA 72 compliant.

(2)     Does a fire alarm HMI panel need to comply with NFPA-72 and certified be with FM or UL listed?

If the specific alarms in question indicate the presence of a fire in an occupied building, then yes.  These alarms would need to adhere to the requirements of the “Notification Appliances” sections of the NFPA 72 standard.  Alarms that do not fit the description of indicating the presence of a fire in an occupied building do not.

(3)   Which NFPA standards set requirements for how deluge is triggered? (i.e., automatic from the  local fire alarm panel? Independent deluge fire alarm panel? By manual activation? By mechanical breakage (sprinkler or fusible plug) activation?)

NFPA 16 – Standard on Deluge Foam-Water Sprinkler and Foam-Water Spray Systems provides information regarding the implementation of these systems once it has been determined that they are required to be installed, but does not provide much detail on risk analysis for the determination of their requirement.  NFPA standards are also available that set requirements for the use of sprinkler systems in occupied buildings, most of these being automatically triggered by frangible bulbs activated by high temperatures.  For process plants, especially open area process plants, there are no prescriptive requirements from NFPA with regards to when they are required and whether they should be manually activated, or automatically activated.  These decisions are left up to the individual operating companies and are either prescriptively set in a company’s FGS Philosophy documents, or determined on a case-by-case basis using risk analysis techniques.

The IEC 61511 Standard Doesn’t Understand What a Specification Is…

1 Comment

Due to a set of recent project work, I have again been looking into best practices for safety requirements specifications. I have blogged about this topic in the past, along with giving presentations on the subject, but the state of SRS development is getting worse, with more and more people adopting a “safety case” style of SRS development, that results in wasted time, poor designs, significant rework, and more difficulty in achieving tolerable risk. I think that situation has gotten to the point where I will have to write a book explaining how SRS should be written complete with prominent examples of how they should not be written.

I believe that the standards committee is in great part the most guilty, as the SRS section of the standard places requirements for SRS documentation that are clearly not specifications and should never be placed in a specification document. In order to understand the problem, I think that we need to go to the root source – a dictionary – and understand exactly what a specification is.

Specification

The definition above comes from Dictionary.com.

The definition clearly states that specifications are detailed descriptions of a proposed object or system. Specifications define the required attributes of a system that does not yet exist, and form the basis for how that system will be built.

Most of us who have work experience writing specifications (as opposed to academics whose first exposure to specifications was reading clause 10 of IEC 61511) know that specifications ARE attributes of a proposed piece of equipment and ARE NOT operating instructions for that piece of equipment, maintenance instructions for that piece of equipment, or data that was used to size and select that piece of equipment.

Given our re-affirmed knowledge of what a specification is, let’s go back through IEC 61511, Part 10.3 and see how many items in the list are truly specifications, and which are not.

1. a description of all SIF – Specification, explains how the system is to operate
2. requirements to identify and take account of common cause failures – NOT a specification, identification is part of the engineering process, this item is essentially documenting information required for auditing
3. definition of combinations of safe states that are dangeroud – NOT a specification, result of hazards analysis task
4. assumed sources of demand – NOT a specification, documentation of source data utilized for “sizing” calculation
5. proof testing interval – NOT a specification, test intervals are operation/maintenance instructions
6. response time requirements – Specification
7. SIL and mode of operation – NOT a specification, detail of analysis used to develop the specification
8. SIS measurements and trip points – Specification
9. Description of outputs – Specification
10. Requirements for Manual Shutdown – Listing a manual shutdown IS a specification, but requiring documentation to explain why a manual shutdown has been implemented or not is NOT a specification
11. De-energize to trip or energize to trip – Specification
12. Requirements for resetting – Specification
13. Max allowable spurious trip rate – Specification (but usually not required for safety)
14. failure modes and desired response – Desired response to failure is a specification, but a listing of failure modes is NOT
15. Procedures for startup/re-startup – NOT! This clause has “procedures” right in its description. Procedures are not specifications!
16. Interfaces – Specification
17. description of modes of operation – NOT a specification, this is an operational description unless different behavior is expected in different modes
18. Application software requirements – Specification
19. Bypasses – Specification
20. Actions to achieve safe state in presence of fault – usually NOT a specification. If the action is automatic and accomplished by the SIS, then it is a specification, otherwise it is an operating instruction.
21. Mean Time to Repair – NOT a specification, operating/maintenance instruction
22. Identification of dangerous output combinations – NOT a specification, result of PHA task
23. extremes of environmental conditions – Specification, but not a SAFETY specification (that’s an entirely different topic)
24. Identification of abnormal modes of operation – NOT a specification, a process description. But, if different functionality is required in different states, it must be specified.
25. Survivability requirements – Specification

As you can see, much of what is “required” in the SRS section of the standard is in fact not specifications, but operating instructions, maintenance instructions, documentation of risk analysis tasks, and documentation of auditing tasks. While there is great value in having documentation of all of these topics, a single linear document called “the safety requirements specification” is not it.

More on this topic in the future – possibly to the degree of an entire book.

Partial Stroke Test Coverage Assignment Recommendations

Leave a comment

During the process of SIL verification calculations, analysts are often required to assess the effectiveness of performing partial stroke testing.  Partial stroke testing is a type of diagnostic.  Diagnostics have a beneficial effect on the failure probability of a component because if a failure occurs, it can be rapidly detected and repaired, leaving the component in an unavailable state for a much shorter period of time.

The verification of the achievement of specified Safety Integrity Level (SIL) targets requires calculation of the achieved Average Probability of Failure on Demand (PFDavg) of each Safety Instrumented Function (SIF).  In accordance with IEC 61511 (Clause 11.9.2).

11.9.2   The calculated probability of failure of each safety instrumented function due to hardware failures shall take into account

Inter alia

b) the estimated rate of failure of each subsystem, due to random hardware faults, in any modes which would cause a dangerous failure of the SIS but which are detected by diagnostics tests;

c) the estimated rate of failure of each subsystem, due to random hardware faults, in any modes which would cause a dangerous failure of the SIS which are undetected by diagnostics tests;

Clearly, the above requires to the consideration of diagnostic testing in the calculation of PFDavg.

In many cases, end users of SIS elect to employ partial stroke testing (PST) as a means for diagnosing faults in valve systems (potentially including actuators, positioners, and solenoid valves).  There are myriad ways of accomplishing PST, but ultimately, the valve is cause to move a small portion of its total travel, and then measurements are made to determine whether or not the movement has occurred.

In the performance of PFDavg calculations, the beneficial effect of partial stroke testing is considered mathematically by dividing the dangerous undetected failure rate into a component that is detectable, and a component that is not detectable.  The dangerous undetected portion is calculated by multiplying the dangerous failure rate by the diagnostic coverage of the partial stroke test.  Similarly, the dangerous undetected portion is calculated by multiplying the dangerous failure rate by one minus the diagnostic coverage.

While the dangerous failure rate of valves can be determined from a standard source of failure rate data, such as the Kenexis Failure Rate Database, the diagnostic coverage of the partial stroke test is not so easy to ascertain.  Since valve performance is strongly impacted by the environment in which the valve is placed, one cannot assume that the diagnostic coverage is solely an attribute of the valve which can be simply listed in a data book or published by the vendor.  Due to the difficulty in obtaining partial stroke test diagnostic coverages, rules of thumb are typically applied that are based on analysis and experience.

When performing PFDavg calculations on an actual project, the following table can be considered as typical values to be be used for the selection of an appropriate PST diagnostic coverage.  These numbers can be used as the first pass, unless a more detailed analysis is desired by the asset owner or the results from using the numbers in the table below are not satisfactory.

Situation

PST Diagnostic Coverage

Typical Service

70%

Severe Service – e.g., depositing, coating, corrosive,   flashing, high differential pressure

60%

Very Clean Service – e.g. ambient temperature utility   purchased natural gas

80%

 

In some cases, a more detailed analysis of the PST coverage is required.  In these cases, the analyset will be required to perform a failure modes, effects, and diagnostics analysis (FMEDA).  Obviously, this can only be done in situations where substantial and detailed failure data which is broken down by individual failure modes is available.  In these cases, the analyst can use the Kenexis Failure Modes and Effects Analysis Worksheet Template (download from http://www.kenexis.com/resources/).  Using the template, all of the failure modes from the data source are entered.  Each failure mode will be assessed to determine if it is safe, dangerous, or no effect.  For the dangerous detected failure modes, the project engineer will assess whether or not the means of partial stroke testing under consideration will detect the failure mode.  If so, it is marked with a “1” in the detectability column.  The PST diagnostic coverage is then automatically calculated as the rate of detected dangerous failures divided by the total dangerous failure rate.

                       

Forgotten Pieces of PFD Equations for SIL Verification

2 Comments

After teaching the ISA EC 52 course – Advanced Design and SIL Verification, I am reminded of the lack of knowledge regarding SIL verification calculations that many practitioners have.  This ignorance is often compounded by the use of quick and easy software tools that result in improper calculations because the users of the software input information that is not reflective of their actual operation (or the software doesn’t ask for teh proper information because it is not capable of properly processing it.  This problem extends to the ISA TR84.00.02 technical report that provides calculations that are loaded with assumptions that are not necessarily true in practice.

In order to shed some light on the topic, let me explain ALL of the parts of a SIL verification equation for a 1oo1 voting arrangement, and compare that against what is typically modeled.  First, a listing of all of the components in verbal terms:

  1. Unreliability due to dangerous undetected failures
  2. Unavailability due to dangerous detected failures
  3. Unreliability due to common cause failures
  4. Unreliability due to never detected failures
  5. Unavailability due to on line testing

I’m not going to re-hash the difference between unreliability, as I’ve discussed it in another blog.  Suffice it to say that different equations are used in the form of  L*T/2 for unreliability and L*MTTR for unavailability, where L is the failure rate, T is the test interval, and MTTR is the mean time to repair (or mean time of beign in a dangerous state).

Although you see five terms listed above, if you go to the ISA 84 TR 84.00.02 technical report, only one term is shown – L(DU) * T / 2.  This term is item 1 in the list above, or the unreliability due to dangerous undetected failures.  Why are all of the rest of the terms ignored? Because a lot of assumptions have been made about how you will use the device in question, and these assumptions may not be correct.

The unavailability of dangerous detected failures is essentially the fraction of time when a SIS component is unavailable to perform its action because it has failed in a way that is diagnosed, but the device has not been repaired yet.  The equation for this term is L(DD) * (MTTR + TI(A)/2), where L(DD) is the dangerous detected, MTTR is the mean time to repair, and TI(A) is the “automatic” test interval – or the interval at which the diagnostic tests occur.  This term is frequently dropped from the calculation for PFD because it is not relevant if a detected failure results in a vote to trip.  If for instance, a failure in a pressure sensor is detected by diagnostics, and this failure propagates into an immediate automatic shutdown of the plant it is essentially immediately converted into a safe failure.  In this paradigm, the SIS component is never unavailable (as the result of a daggerous detected failure) because a shutdown of the plant immediately follows the failure.  But, if the system is configured for a diagnosed failure to result in an alarm, and the process continues to operate in the presence of the failed component, then the unavailability of the system must be included by adding in the unavailability of dangerous detected failures term.  This fact is often overlooked by practitioners.

The unreliability of common cause failures term is fairly commonly used by practitioners, but there are still some who erroneous do not include them becuase they feel that common cause failures are not credible in a well designed system.  History has shown that this mind set is ill-advised and dangerous.  The common cause failure unreliability is not relevant, though, to a 1oo1 voting arrangement because there is only one component.

Unreliability due to never detected failures is another term that is commonly ignored, and perhaps should not be.  If this term is ignored, there is an assumption that each proof test will identify 100% of dangerous failures.  While this should be the objective of test plans and test plan development, it is optimistic in practice.  In reality, some devices (such in level sensors that are directly inserted into vessels) can not be tested in situ, requiring them to be removed and testing in an instrument shop.  This type of test can fail to identify vessel related or process connection failures.  Another good example is high pressure drop shutoff valves.  Testing usually occurs when the plant is offline and not facing the stress of the pressure drop.  IT can happen that a valve can stroke fine while there is no pressure drop, but not be able to stroke while actually in service.  Unless you can justify 100% manual test coverage the unreliability due to never detected failures should be included, it is L(DN) * Life / 2, where L(DN) is the failure rate of never detected failures, and Life is either the useful life of the component or the amount of time between major overhaul or rebuild.

Finally, the unavailability due to online testing term is often ignored because all testing of a component occurs while the plant is offline for a turnaround.  If this is not the case, and the component must be tested while the plant is online, then it is unavailable to perform its shutdown action while it is in bypass for the testing to occur.  The term is a simple unavailability, TD/TI(M), where TD is the duration of the test and TI(M) is the manual proof test interval.  It is important to note that TD is the total test duration during a manual proof test interval.  As such, if multiple on-line tests occur during a single manual proof test interval, they all need to be added up to calculate the total TD.

As you can see, calculation of PFD for SIL verification many include multiple terms that are commonly ignored.  Deciding whether or not these terms are relevant requires knowledge of how the system is configured, operated, and tested.  A SIL verification software tool does not fill this information in for you, and often makes assumptions about how you operate your plant that may not be true.  As a result, you need to be vigilant in your calculations to make sure that they are accurate.

 

Older Entries

Follow

Get every new post delivered to your Inbox.

Join 49 other followers

%d bloggers like this: