ILC Global Control System John Carwardine, ANL Fermilab
ILC Global Control System John Carwardine, ANL Fermilab ILC School, July 07 1 ILC Accelerator overview Major accelerator systems Polarized PC gun electron source and undulator-based positron source. 5-GeV electron and positron damping rings, 6.7km circumference. Beam transport from damping rings to bunch compressors. Two 11km long 250-GeV linacs with 15,000+ cavities and ~600 RF units. A 4.5-km beam delivery system with a single interaction point. Fermilab ILC School, July 07 J. Bagger 2 Control System Requirements and Challenges
General requirements are largely similar to those of any large-scale experimental physic machines but there are some challenges Scalability 100,000 devices, several million control points. Large geographic scale: 31km end to end Multi-region, multi-lab development team. Support ILC accelerator availability goals of 85%. Intrinsic Control system availability of 99% by design. Cannot rely on approach of fix in place. May require 99.999% (five nines) availability from each crate. Functionality to help minimize overall accelerator downtime. Fermilab ILC School, July 07 3 Requirements and Challenges (2) Precision timing & synchronization Distribute precision timing and RF phase references to many technical systems throughout the accelerator complex. Requirements consistent with LLRF requirements of 0.1% amplitude and 0.1 degree phase stability. Support remote operations / remote access (GAN / GDN) Allow collaborators to participate with machine commissioning, operation, optimization, and troubleshooting. At technical equipment level there is little difference between on-site and off-site access - Control Room is already remote. There are both technical and sociological challenges.
Fermilab ILC School, July 07 4 Requirements and Challenges (3) Extensive reliance on machine automation Manage accelerator operations of the many accelerator systems, eg 15,000+ cavities, 600+ RF units. Automate machine startup, cavity conditioning, tuning, etc. Extensive reliance on beam-based feedback Multiple beam based feedback loops at 5Hz, eg Trajectory control, orbit control Dispersion measurement & control Beam energies Emittance correction Fermilab ILC School, July 07 5 Control System Functional Model Client Tier GUIs Scripting
Services Tier Business Logic Device abstraction Feedback engine State machines Online models Front-End Tier Technical Systems Interfaces Control-point level Fermilab ILC School, July 07 6 Physical Model as applied to main linac (Front-end)
Fermilab ILC School, July 07 7 Some representative component counts Component Description Quantity 1U Switch Initial aggregator of network connections from technical systems 8356 Controls Shelf Standard chassis for front-end processing and instrumentation cards 1195 Aggregator Switch High-density connection aggregator for 2 sectors of equipment 71
Controls Backbone Switch Backbone networking switch for controls network 126 Phase Ref. Link Redundant fiber transmission of 1.3GHz phase reference 68 Controls Rack Standard rack populated with one to three controls shelves 753 LLRF Station Two racks per station for signal processing and motor/piezo drives 668 Fermilab ILC School, July 07
8 Which Control System? Established accelerator control system..? EPICS, DOOCS, TANGO, ACNET, Development from scratch? Commercial solution? Too early to down-select for ILC and there are benefits to not down-selecting during R&D phase Fermilab ILC School, July 07 9 Availability Design Philosophy for the ILC Design for Availability up front. Budget 15% downtime total. Keep an extra 10% as contingency. Try to get the high availability for the minimum cost. Will need to iterate as design progresses. Quantities are not final Engineering studies may show that the cost minimum would be attained by moving some of the unavailability budget from one item to another. This means some MTBFs may be allowed to go down, but others will have to go up. Availability/reliability modeling (Availsim)
Fermilab ILC School, July 07 10 Availability budgets by system (percentage total downtime) RF power sources 12% Cryo 8% Vacuum 16% PS + controllers 17% Magnets 5% AC power 7% Water system 11% RF structure 7% Diagnostic 0%
Fermilab ILC School, July 07 controls 17% 11 MTBF/MTTR requirements from Availsim Im prove m e nt fa ctor A tha t give s 17% Downtim e (%) due to the se de vice s for downtim e for 2 2 tunne l undula tor Nom ina l tunne l undula tor e + source e + source with strong ke e p_a live Nom ina l MTBF (hours) MTTR (hours)
magnets - water cooled power supply controllers flow switches water instrumention near pump power supplies kicker pulser 20 10 10 10 5 5 0.4 0.6 0.5 0.2 0.2 0.3 1,000,000 100,000 250,000 30,000 200,000 100,000 8
4 4 klystron - linac coupler interlock electronics 0.8 0.4 40,000 1,000,000 8 1 vacuum pumps controls backbone 0.9 0.8 10,000,000 300,000 4 1 De vice High Availability primer
Availability A = MTBF/(MTBF+MTTR) MTBF=Mean Time Before Failure MTTR= Mean Time To Repair If MTBF approaches infinity A approaches 1 If MTTR approaches zero A approaches 1 Both are impossible on a unit basis Both are possible on a system basis. Key features for HA, i.e. A approaching 1: Modular design Built-in 1/n redundancy Hot standby systems Hot-swap capable at subsystem unit or subunit level
Fermilab ILC School, July 07 13 Systems That Never Shut Down Any large telecom system will have a few redundant Shelves, so loss of a whole unit does not bring down system like RF system in the Linac. Load auto-rerouted to hot spare, again like Linac. Key: All equipment always accessible for hot swap. Other Features: Open System Non-Proprietary very important for non-Telecom customers like ILC. Developed by industry consortium of major companies sharing in $100B market. 20X larger market than any of old standards including VME leads to competitive prices. PICMG -- PCI Industrial Computer Manufacturers Group
Fermilab ILC School, July 07 14 Controls Cluster Dual Star/ Loop/Mesh FEATURES Dual Star 1/N Redundant Backplanes Applications Modules Redundant Fabric Switches Dual Star/ Loop/ Mesh Serial Links Dual Star Serial Links To/From Level 2 Sector Nodes Dual Fabric Switches Dual Star to/From Sector Nodes Fermilab ILC School, July 07
15 HA Concept DR Kicker Systems Fermilab ILC School, July 07 Approx 50 unit drivers n/N Redundancy System level (extra kickers) n/N Redundancy Unit level (extra cards) Diagnostics on each card, networked, local wireless 16 Physical Model as applied to main linac (Front-end) Fermilab ILC School, July 07 17 High Availability Control System
Control system itself must be highly available Redundant and hot-swap hardware platform (baseline ATCA). Redundancy functionality in control system software. In many cases, redundancy and hot-swap/hot-reconfigure can only be implemented at the accelerator system level, eg Rebalance RF systems if a klystron fails. Modify control algorithm on loss of critical sensor. Control System will provide High Availability functionality at the accelerator system level. Technical systems must provide high level of diagnostics to support remote troubleshooting and re-configuration. Fermilab ILC School, July 07 18 ATCA as a reference platform 5-Slot Crate w/ Shelf Manager Fabric Switch Dual IOC Processors 4 HotSwappable Fans 16 Slot Dual Star Backplane Shelf Manager Dual 48VDC
Power Interface Fermilab ILC School, July 07 Dual IOCs Fabric Switch Rear View R. Larsen 19 Fermilab ILC School, July 07 20 ATCA as reference platform for Front-end electronics Representative of the breadth of high-availability functions needed Hot-swappable components: circuit boards, fans, power supplies, Remote power management: power on/off each circuit board Supports redundancy: processors, comms links, power supplies, Remote resource management through Shelf Manager TCA offers lower cost but with reduced feature set. There is growing interest in the physics community in exploring ATCA for instrumentation and DAQ applications. As candidate technology for the ILC, ATCA/TCA have strong potential currently is it an emerging standard. Fermilab ILC School, July 07
21 Read Out evolution LHC --> ILC Subdetector Subdetector Digital Buffer CCTA Read Out Crate (VME 9U) Read Out Driver 92 AMC SLink 400 Robin (PCI) ROS (150 PCs) Read Out Buffer (3 ROBin) Gbit Link to Gbe Switch (60 PCs) Fermilab ILC School, July 07
ATCA Module ATCA Crate 22 Cost/Benefit Analysis of HA Techniques Availability (benefit) 13. Automatic failover 12. Model-based automated diagnosis 11. Manual failover (eg bad memory, live patching) 10. Hot swap hardware 9. Application design (error code checking, etc) 8. Development methodology (testing, standards, patterns) 7. Adaptive machine control (detect failed BPM, modify feedback) 6. Model-based configuration management (change management) 5. Extensive monitoring (hardware and software) 4. COTS redundancy (switches, routers, NFS, RAID disks, database, etc.) 3. Automation (supporting RF tune-up, magnet conditioning, etc.) 2. Disk volume management 1. Good administrative practices Cost (some effort laden, some materials laden) Fermilab ILC School, July 07 23
HA R&D objectives Learn about HA (High Availability) in context of accelerator controls Bring in expertise (RTES, training, NASA, military, ) Develop (adopt) a methodology for examining control system failures Fault tree analysis FMEA or scenario-based FMEA Supporting software (CAFTA, SAPPHIRE, ) Others? Develop policies for detecting and managing identified failure modes Development and testing methodology Workaround Redundancy Develop a full vertical prototype implementation
Ie. how we might implement above policies Integrate portions of vertical prototype with test stands (LLRF) Feed some software-oriented data to SLAC availability simulation? Fermilab ILC School, July 07 24 High Availability Software What are the most common and critical failure modes in control system software? Mis-configuration Network buffer overruns Ungraceful handling of failed sensors/actuators Application logic bugs Flying blind (lack of monitoring) Task deadlock
Introduction of untested features Accepting conflicting commands More How do we mitigate these, and what is the cost/benefit? Availability = Software QA Development Methodology Conflict Avoidance Fermilab ILC School, July 07 MTBF MTBF + MTTR Configuration Management Infrastructure Monitoring Software Runtime Lifecycle Management Shelf Management Automation 25 Sample of Techniques: Shelf Management Client Tier Services Tier IPMI, HPI, SNMP,
others Custom I/O CPU2 CPU1 SM Front-end tier Controls Protocol sensor Fermilab ILC School, July 07 Shelf Manager: Identify all boards on shelf Power cycle boards (individually) Reset boards Monitor voltages/temps Manage Hot-Swap LED state Switch to backup flash mem bank More 26
SAF Availability Management Framework A simple example of software component runtime lifecycle management Service Unit Administrative States AMF Logical Entities Node U Node V Service Unit Service Unit Component Component Component Component active standby Service Instance
Service Instance is work assigned to Service Unit Service Group Unlocked Locked Shutting down LockedInstantiation 1. Service unit starts out un-instantiated. 2. State changed to locked, meaning software is instantiated on node, but not assigned work. 3. State changed to unlocked, meaning software is assigned work (Service Instance). Fermilab ILC School, July 07 27 SAF Service Availability Forum Specifications Application Interface Specification CLM
AMF IMM NTF LOG CKPT MSG EVT LCK HA Applications Other Middleware and Application Services AIS Middleware HPI Middleware Carrier Grade Operating System Managed Hardware Platform Sensor Control
Annunciator Inventory Watchdog Hotswap Power Reset Event Config Hardware Platform Interface Diagram courtesy of Service Availability Forum Fermilab ILC School, July 07 28 SAF Availability Management Framework AMF Availability Management Framework Manages software runtime lifecycle, fault reporting, failover policies, etc. Works in combination with a collection of well-defined services to provide a powerful environment for application software components.
CLM Cluster Membership Service LOG Log Service CKPT Checkpoint Service EVT Event Service LCK Lock Service More An open standard from telecom industry geared towards supporting a highly available, highly distributed system. Potential application to critical core control system software such as IOCs, device servers, gateways, nameservers, data reduction, etc. Know exactly what software is running where. Be able to gracefully restart components, or manage state while hotswapping underlying hardware. Uniform diagnostics to troubleshoot problems. Fermilab ILC School, July 07 29 An HA software framework is just the start SAF (Service Availability Forum) implementations wont solve HA problem You still have to determine what you want to do and encode it in the framework this is where work lies
1.What are failures 2.How to identify failure 3.How to compensate (failover, adaptation, hot-swap) Is resultant software complexity manageable? Potential fix worse than the problem Always evaluate: am I actually improving availability? Fermilab ILC School, July 07 30 R&D Engineering Design (EDR) Phase Main focus of R&D efforts are on high availability Gain experience with high availability tools & techniques to be able to make value-based judgments of cost versus benefit. Four broad categories Control system failure mode analysis High-availability electronics platforms (ATCA) High-availability integrated control systems Conflict avoidance & failover, model-based resource monitoring. Control System as a tool for implementing system-level HA Fault detection methods, failure modes & effects Fermilab ILC School, July 07 31 HA means doing things differently
ILC must apply techniques not typically used at an accelerator, particularly in software Development culture must be different this time. Cannot build ad-hoc with in-situ testing. Build modeling, simulation, testing, and monitoring into hardware and software methodology up front. Reliable hardware Instrumentation electronics to servers and disks. Redundancy where feasible, otherwise adapt in software. Modeling and simulation (T. Himel). Reliable software Equally important. Software has many more internal states difficult to predict. Modeling and simulation needed here for networking and software. Fermilab ILC School, July 07 32 Controls topic areas
LLRF algorithms RF phase & timing distribution, synchronization Machine automation, beam-based feedback ATCA evaluation as front-end instrumentation platform ATCA evaluation for control system integration HA integrated control system Integrated Control System as a tool for system-level HA Remote access, remote operations (GAN/GDN) Failure modes analysis Lots of opportunities to get involved Fermilab ILC School, July 07 33
: Encourage kids to use a planner so they can block out when big tests or assignments are due and back out enough time to complete them efficiently and without stress (or as stress free as possible) These can also...
Times New Roman Tahoma Wingdings Arial blends.pot Microsoft Word Document Process for systematic conversion of a design in "C-pseudo code" to SHARC 21061 assembly code To be tackled today SHARC process -- Respect the registers that the "C" compiler uses...
Geography of South Carolina. Please turn in your Venn Diagrams today 2nd and 6th Period. 1. st. Period Boys-Venn Diagram due Fri. 8/28. 6Landform Regions of South Carolina Notes. Work on your state maps-label the land regions and on the...
Chapter 6 Visual Summary Harappan Art ... Wrote Vedic texts, which were their thoughts about the Vedas Evolving Beliefs The Vedas, Upanishads, and other Vedic texts began blending with beliefs from different cultures, creating Hinduism. Hinduism Many deities Reincarnation: could...
Transcriptome and analysis of gene transcription Gene expression DNA (Genome) pre-mRNA mRNA mRNA (Transcriptome) Proteins (Proteome) Metabolites (Metabolome) Regulation Nucleus Cytoplasm Chromatography Mass spectrometry NMR DNA arrays and chips (semi) qRT-PCR Northern blot + hybrid.
E.g. cell will be mapped to a MeSH heading first; therefore PubMed will cease to map it to other (author or journals) indexes. If PubMed cannot match the term in either the MeSH or Journals Tables it will then try...
Weaving Words Liz Lochhead's poem "Kidspoem, Bairnsang" has been described as bilingual. What do you think the term bilingual means? ... Katie Bairdie had a dug... Do you think you could try to write a verse using some Gaelic words?...
Topologically Integrated Geographic Encoding and Referencing system (TIGER) / Line Files and census data from the Bureau of the Census, the Digital Line Graph files from the U.S. Geological Survey, and economic and agricultural data.
Ready to download the document? Go ahead and hit continue!