CARA FaultTree Jrn Vatn http://folk.ntnu.no/jvatn/ppt/CARAFaultTree.pptx Content Dag 1 Gjennomgang feiltreanalyse som metode (definisjon av topphendelse, konstruksjon, minimale kuttmengder, kvalitativ og kvantitativ analyse) Konstruksjon av feiltre i CARA FaultTree Innlegging av plitelighetsdata Import/Eksport av plitelighetsdata Minimale kuttmengder Frekvens og sannsynlighet for TOPP-hendelsen Ml for plitelighetsmessig betydning Dag 2 Valg av analysetilnrminger (simulering, numerisk integrasjon mm)

Innstillinger/valg Begrensninger i feiltreanalysen Kobling av feiltreanalyse og enkel MARKOV analyse Kobling feiltre og hendelsestreanalyse Bruk av Excel sammen med CARA FaultTree What is a fault tree? A fault tree is a logic diagram that displays the relationships between a potential critical event (accident) in a system and the reasons for this event The reasons may be environmental conditions, human errors, normal events (events which are expected to occur during the life span of the system) and specific component failures A properly constructed fault tree provides a good illustration of the various combinations of failures and other events which can lead to a specified critical event The fault tree is easy to explain to engineers without prior

experience of fault tree analysis What is a fault tree analysis (FTA)? A fault tree analysis may be qualitative, quantitative or both, depending on the objectives of the analysis. Possible results from the analysis may e.g. be: A listing of the possible combinations of environmental factors, human errors, normal events and component failures that can result in a critical event in the system. The probability that the critical event will occur during a specified time interval The most critical components in the system FTA-Procedure 1. Definition of the problem and the boundary conditions.

2. Construction of the fault tree. 3. Identification of minimal cut sets. 4. Qualitative analysis of the fault tree. 5. Quantitative analysis of the fault tree. 1 - Definition of the Problem and the Boundary Conditions Definition of the critical event (the accident) to be analysed. The critical event (accident) to be analysed is normally called the TOP event. Definition of the boundary conditions for the analysis, i.e., what to include (external power, sabotage etc) To help defining the TOP event: What: Describes what type of critical event (accident) is occurring, e.g. collision between two trains. Where: Describes where the critical event occurs, e.g. on a single track section.

When: Describes when the critical event occurs, e.g. during normal operation. Example system Servo motors Main distributing valve Servo valves Water inlet Turbine runner IPC PLC

IPC PLC Governing system Guide vanes Pu Draft tube Oil pressure system System description In order to control the frequency of the turbine runner (TR) both servo motors (SM) have to function to put the guide vanes (GV) in correct position

The main distributing valve (MDV) is controlled by two servo valves (SV) Each servo valve is a gain controlled by a programmable logical controller (PLC) via an input card (IPC) It is sufficient that one servo valve with IPC and PLC is functioning in order to have the main distributing valve to operate The oil pressure system (OPS) comprises both an oil tank, and an oil pump Definition of the TOP event What: Guide vanes (ledeskovler) in wrong position Where: At the turbine runner When: Under normal operation, i.e., no system out for maintenance SYMBOL

OR gate DESCRIPTION The OR-gate indicates that the output event A occurs if any of the input events Ei occurs. A E E 1 2 E

3 AND gate The AND-gate indicates that the output event A occurs only when all the input events Ei occurs simultaneously. A LOGIC GATES E E 1

2 E 3 KooN gate The KooN-gate indicates that the output event A occurs if K or more of the input events Ei occurs. A K /N E E

1 2 E 3 Inhibit gate The INHIBIT gate indicates that the output event A occurs if both the conditional event E1 and the input event E2 occur. A

E1 E2 BASIC event The Basic event represents a basic equipment fault or failure that requires no further development into more basic faults or failures. HOUSE event The House event represents a condition or an event which is TRUE (ON) or FALSE (OFF) (not true). UNDEVELOPED event

The Undeveloped event represents a fault event that is not examined further because information is unavailable or because its consequence is insignificant. The Comment rectangle is for supplementary information. INPUT EVENTS DESCRIPTION OF STATE COMMENT rectangle TRANSFER down TRANSFER SYMBOLS TRANSFER up

The Transfer down symbol indicates that the fault tree is developed further at the occurrence of the corresponding Transfer up symbol. 2. Construction of the Fault Tree 1. Start with the TOP event 2. Identify all fault events which are the immedi ate, necessary and sufficient causes that result in the TOP event (What are the (direct) reasons for...?) 3. Connect to the TOP event via a logic gate (AND or OR gate) 4. Proceed in the same manner with the gates until the basic event level is reached Construction, Step 1

TOP = Guide vanes in wrong position Direct causes Guide vanes stuck Insufficient power from servo motors Combine with OR-gate In CARA FaultTree Guide vanes in wrong position TOP Insufficient power from servo motors Servo

Guide vanes stuck GV Exercise Complete the analysis for the example case Only qualitative input No data No analysis Always ask for what are the reasons for..? CARA FaultTree pages Use several pages if The fault tree becomes large

The same system is included several places in the fault tree (physically the same) Procedure Add a Transfer Down Right click | New Page Change name of the new page Naming conventions Two symbols with the same name represent the same component If two components are identical (same failure rate etc), but are two physical components, they should have different name, e.g., PLC1 and PLC2 When changing the name of a component, and giving the name of an existing component, you are

prompted to verify this. The current component then gets data from the existing component 3 - Identification of minimal cut sets A fault tree provides valuable information about possible combinations of fault events which can result in a critical failure (TOP event) of the system A cut set in a fault tree is a set of Basic events whose (simultaneous) occurrence ensures that the TOP event occurs A cut set is said to be minimal if the set cannot be reduced without loosing its status as a cut set 4 - Qualitative Evaluation of the Fault Tree A qualitative evaluation of the fault tree may be carried out on the basis of the minimal cut sets The importance of a cut set depends obviously on

the number of Basic events in the cut set. The number of different Basic events in a minimal cut set is called the order of the cut set A cut set of order one is usually more critical than a cut set of order two, or higher 4 - Qualitative Evaluation of the Fault Tree, cont. Another important factor is the type of Basic events in a minimal cut set We may rank the criticality of the various cut sets according to the following ranking of the Basic events: 1. Human error 2. Failure of active equipment 3. Failure of passive equipment

5 - Quantitative Analysis of the Fault Tree When reliability data for each of the basic events is available, it is possible to carry out a quantitative evaluation of the fault tree. Different system reliability measures may be of interest: Q0(t) t R0(t) in [0,t) MTTF0 F0 I(i,t) - The probability that the TOP event occurs at time - The probability that the TOP event does not occur - Mean time to first system failure - TOP event frequency

- Importance of component i at time t Input data to the FTA Quantitative result from a fault tree cannot be obtained unless failure data for each basic event is defined Various types of data are used depending on the situation: Frequency On demand probability Test interval Repairable unit Non repairable unit Input data Category of failure data

Reliability Parameters Frequency On demand probability Test interval f = Frequency 1) q = Probability t* =Test interval 2), t = Mean time to repair (MTTR) 2) and l = Failure rate3) t = Mean time to repair 2) and l = Failure rate3) l = Failure rate3) Repairable unit Non repairable unit 1) 2)

3) Expected number of occurrences per 106 hours. Given in hours. Expected number of failures per 106 hours. CARA FaultTree Frequency This category is used to describe events occurring now and then, but with no duration. Thus the probability that the event is occurring at time t, qi(t) = 0. Note! If there is a duration of the event, the event should be described as a repairable unit, where the failure rate equals the frequency of the event, and the mean down time equals the

duration On demand probability This category is usually used to describe components which is not activated during normal operation The component is demanded only now and then The reliability data represents the probability that the component is not able to perform its function upon request In safety systems, the operator is often modelled by an on demand probability, for example: Operator fails to activate manual shut-down system Test interval This category is used to describe components which are tested periodically with test interval t A failure may occur anywhere in the test interval

The failure will, however, not be detected until the test is carried out or the component is needed Note that the formula used in CARA FaultTree only is valid if we have independent testing of each component For simultaneously testing correction factors are required, i.e., failure rates should be multiplied with where N is the number of components tested simultaneously Repairable and non repairable units Repairable unit: The component is repaired when a failure occurs Non repairable unit: It is not possible to repair the unit when a failure occurs (at least not within the analysis period) Sharing data In some situations

many components are considered identical, i.e., they have the same failure rate, MTTR etc We may then define event classes to specify look-up data: Export data Select File | Export data to export data to a text file The file can be imported into MS Excel: CARA FaultTree reliability database Since CARA FaultTree can import and export reliability data, it is possible to establish a

reliability database which can easily be imported into a fault tree Generic reliability data should be stored as When the fault tree is completed qualitatively, one may import all reliability data from the database, given efficient naming convention Q0(t) - The probability that the TOP event occurs at time t The TOP event probability Q0 depends on the structure of the fault tree (minimal cut sets) and the probabilities that the various basic events occurs In order to calculate Q0 we use an approximation formula. The idea is to sum the contribution from each cut set Let the minimal cut sets be denoted K1,K2,...,Kk, and

assume that the basic events are independent Then the probability that minimal cut set Kj occurs is given by the product of the basic event occurrence probabilities: Q0(t), cont Summing the contributions for each cut set gives 1 a 2 5 4

2 3 Q1= q1 q2 q3 Q0 = Q 1 Q2= q4 + Q2 Q3= q5 q2 +Q3 b Q0(t) Upper bound A slightly better approximation is given by the upper bound approximation

F0 TOP event frequency We will demonstrate the method for calculating the TOP event frequency, F0 in the following situation for each minimal cut set: One and only one basic event is of the type frequency with occurrence rate f The remaining basic events is of the type barrier/ on demand probability, with a barrier probability q F0 Formulas Where fkj is the rate of the basic event of type frequency in cut set Kj qi is the probability that basic event i in cut set Kj

occurs Measures of Importance The reliability importance of a component in a system will generally depend on the location of the component in the system, and the reliability of the component A number of different measures are available in CARA FaultTree: Vesely-Fussells measure of reliability importance. Birnbaums measure of reliability importance. Improvement potential. Criticality Importance. Order of smallest cut set Birnbaums measure of structural importance Vesely-Fussell Vesely-Fussells measure of reliability

importance for component i is defined by: IVF(i|t) = the conditional probability that at least one minimal cut set containing input event no. i is failed at time t, given that the system fails at time t IVF(i|t) is rather simple to calculate Birnbaums Measure Birnbaums measure of reliability importance for component i is defined as follows IB(i|t) = the partial derivative of Q0(t) with respect to qi(t) It can be shown that: IB(i|t) = Pr(TOP-event occurs at t | qi(t)=1) Pr(TOP-event occurs at t | qi(t)=0)

It can also be shown that IB(i|t) = The probability that component i is critical Improvement potential The improvement potential reliability measure for component i is defined by: IIP(i|t) = the increase in system reliability if component i is replaced with a perfect component at time t Criticality Importance The criticality importance reliability measure for component i is defined by: ICR(i|t) = the probability that component i is critical for the system and is failed at time t, given that the system is failed at time t.

Order of smallest cut set The order of smallest cut set importance measure is defined by IO(i) = The order of the smallest cut set containing component i Systems with buffers CARA FaultTree is not designed to treat buffer system Some tricks exist Consider the oil pressure system Buffers, cont. Assume the oil pressure tank has 8 hours capacity when fully charged Upon a failure of the pump, the system will continue to support oil pressure for 8 hours

If a repair is conducted before 8 hours, no system disturbance will occur What is the rate of failure of the oil pressure system, and what is the MTTR? Buffers, cont Assume that MTTR for the pump is 4 hours Assume that l = 400 (per million hours) Let p = Pr(TTR > buffer capacity) = e-8/4 The oil pressure system may now be modelled by: lSystem = lPumpp

MTTRSystem = MTTRPump (TTR = Time To Repair) UPS-Buffer Assume a UPS is installed as a backup in case of loss of external power The mean repair time of external power is MTTRE The capacity of the UPS is B The UPS may also fail due to technical failures, and the mean repair time is MTTRU Probability that the battery capacity is exceeded is

Fault tree CARA Fault Tree version 4.1 (c) Sydvest Sotfware 1999 Licenced to: SINTEF Industrial Management Dept of Safety and Reliability No power to end user TO P Loss of external power UPS does not provide power LEXP

UPS Lambda= 0.0001 MT TR= 10 Battery capacity of 24h exceeded. Failure probability = exp(-24/10) = 0.0907 Technical failure of UPS UPSBattery

USBT echnical Probability= 0.091 Lambda= 1e-005 MT TR= 24 TOP event figures Frequency of Top event (TOP): 9.10398e-006[Occ. per Hours] Unavailability [Qo(t)]: 9.0849e-005 Markov Analysis Introduction Markov analysis is used to model systems which have many different states

These states range from perfect function to a total fault state The migration between the different states may often be described by a so-called Markov-model The possible transitions between the states may further be described by a Markov diagram Purpose Markov analysis is well suited for deciding reliability characteristics of a system Especially the method is well suited for small systems with complicated maintenance strategies In a Markov analysis the following topics will be of interest Estimating the average time the system is in each state. These numbers might further form a basis for economic considerations. Estimating how frequent the system in average visits the various states. This information might further be used to estimate the need

for spare parts, and maintenance personnel. Estimate the mean time until the system enters one specific state, for example a critical state. Markov Analysis procedure 1. Make a sketch of the system 2. Define the system states 3. Draw the Markov diagram with the transition rates 4. Quantitative assessment 5. Compilation and presentation of the result from the analysis Make a sketch of the system Pump system wit active pump and a spare pump in standby Active

pump Standby pump Definition of system states x1 = state of active pump x2 = state of standby pump 1 if component i is functioning xi 0 if component i is in a fault state System state

xS Component state Comments x1 x2 2 1 1 0 1 1 0

0 0 Both pumps functioning The active pump is in a fault state, the standby pump is functioning Both pumps in a fault state State transitions l1 l2 1

B For this system we have assumed that if the active pump fails, the standby pump could always be started Further we assume that if both pumps have failed, they will both be repaired before the system is put into service again The following transition rates are defined = failure rate of the active pump = failure rate of the standby pump (while running, l2 = 0 in standby position) = repair rate of the active pump (1/1 = Mean Down Time when the active pump has failed) = repair rate when both pumps are in a fault state. I.e. we assume that if the active pump has failed, and a repair with repair rate 1 is started, one will start over again with repair rate B, if the standby pump also fails, independent of how much have been repaired on the active pump.

Markov state space diagram The circles represent the system states, and the arrows represent the transition rates between the different system states The Markov diagram and the description of states represent the total qualitative description of the system l1 2 1 l2 1

0 B Quantitative assessment We want to assess the following quantities Average time the system remain in the various system states The visiting frequencies to each system state l1 2 1 l2 1

0 B a00 a01 a0 r a a a 11 1r 10 A aij

The indexing starts on 0, and moves to r, e.g. there are r +1 system states ar 0 ar1 arr Each cell in the matrix has two indexes, where the first (row index) represent the from state, whereas the second (column index) represent the to state. The cells represent transition rates from one state to another aij is thus the transition rate from state i to state j The diagonal elements are a kind of dummy-elements, which are filled in at the end, and shall fulfil the condition that all cells in a row adds up to zero Transition matrix

Example transition matrix: (From , To ) 0 0 A 1 2 B l

2 0 1 0 l2 1 l1 2 B 1 l1 l1 2 1

l2 1 0 B State probabilities Let Pi(t) represent the probability that the system is in state i at time t Now introduce vector notation, i.e. P(t) = [P0(t), P1(t),,Pr(t)] From the definition of the matrix diagram it might be shown that the Markov state equations are given by: P(t) A = d P(t)/d t These equations may be used to establish both the steady

state probabilities, and the time dependent solution Steady state probabilities Let the vector P = [P0, P1,,Pr] represent the average time the system is in the various system states in the long time run For example, P0 is average fraction of the time the system is in state 0, P1 is average fraction of the time the system is in state 1 The elements P = [P0, P1,,Pr] are also denoted steady state probabilities to indicate that in the stationary situation Pi represents the probability that the system is in state i. The steady state solution In the long run when the system has stabilized we must have that d P(t)/d t = 0, hence

PA = 0 This system of equations is over-determined, hence we may delete one column, and replace it with the fact that P0+ P1++Pr = 1 Hence, we have The steady state solution P A1 = b where a00 a A1 10 ar 0

a01 1 a11 1 ar1 1 and b = [0,0, ,0,1] Example P0 P1 B

P2 l2 0 0 l2 1 l1 which gives l1l2 P0 (l2 B )l1 (l2 1 ) B B l1 P1 ( l 2 B ) l1 ( l 2 1 ) B B (l 2 1 ) P2

(l 2 B )l1 (l 2 1 ) B 1 1 0 0 1 1 Numerical solution To solve the steady state equations P A1 = b is a tedious task Often we therefore solve these equations by numerical methods The Markov.xls program does this, where we have to: Define the transition rates Assign numerical values to the transition rates Specify the Markov state space matrix

Program for simple Markov analysis Transition matrix Parameter Value Dim 3 Init 2 SystFail 0 l1 1.00E-03 l2 5.00E-03 1

0.125 B 0.0416667 Steady state pr. P0 0.000915 P1 0.007627 P2 0.991458 Visit frequencies v0

3.81317E-05 v1 0.000991458 v2 0.000991458 -0 0 MTTFS 26200.02 Parameter Numeric values names of the parameters (Give the cells names) l1

2 From 0 From 1 From 2 To 0 To 1 To 2 -0.04167 0 0.041667 0.005 -0.13 0.125 0 0.001

-0.001 1 0 A 1 l2 1 0 B 2

B l 2 0 1 0 l2 1 l1 2 B 1 l1

Visiting frequencies Often we are interested in evaluating how many times the system enters the various states, i.e. the visiting frequencies The visiting frequency for state j is denoted j, and could be obtained by: j = -Pjajj From our example we obtain the system failure rate B l1l2 0 P0a00 (l2 B )l1 (l2 1 ) B Time dependent solution Up to now we have investigated the steady state situation In some situations we also want to investigate the time

dependent solution, i.e. the probability that the system is in e.g. state 0 at time t We now let Pi(t) be the probability that the system is in state i at time t The time dependent solution may be found by: P(t) A = d P(t)/d t Which could be solved by Laplace methods, or numerical methods For numerical methods we apply Markov.xls Standby generator Consider a system with fed by a public net The failure rate and repair rate of the net is lN and N A generator is installed as a cold backup in case of failure of the public net In passive mode the generator has a failure rate lG,0, and in active mode the failure rate is lG, and the

repair rate is G In standby mode the generator is tested with intervals of length t to reveal hidden fail to start failures Simplified Markov diagram 2 lN N pl N 1 lG 0

G From Markov analysis 0 = Visiting frequency to state 0 0 to be used as failure rate (l) in FTA P0 = Steady state probability of state 0 MDT0 = Mean sojourn time in state 0 MDT0 P0 /0 to be used as MTTR in FTA http://folk.ntnu.no/jvatn/ComputerPrograms/MarkovStandbyGenerator.xls Exact Markov diagram 2 lN pl N

1 N lG 0 G G State 3 = Net restored before standby generator is repaired lN N 3

The net is more than one component: Assume that the external net is modelled by e.g., a fault tree This means that we need to find lN and N from the FTA We then use lN = F0 N = F0 /Q0, since Q0 lN /N lN and N is then input to the Markov, which again is input to the detailed FTA on the cite