Assignment Document

COLLECTIVE INTELLIGENT BRICKSIn this prototype system, a

Pages:

Preview:


  • "COLLECTIVE INTELLIGENT BRICKSIn this prototype system, a total of nine Gigabit Ethernet connections to external applicationservers was deemed sufficient. Industrial control modules are mounted in the base. They are usedto sense critical system volta..

Preview Container:


  • "COLLECTIVE INTELLIGENT BRICKSIn this prototype system, a total of nine Gigabit Ethernet connections to external applicationservers was deemed sufficient. Industrial control modules are mounted in the base. They are usedto sense critical system voltages and temperatures, detect moisture, and measure the coolant flowrates in the nine cold rail loops; under control of an external management processor, they candetect the presence of individual bricks and turn them on or off. The control modules utilize apoint-to-point signaling path to perform the presence-detect and power-control functions. This isimplemented very robustly by directly controlling the dc/dc converters without any assistance orinterference from the brick processors or software. Thus, an errant brick (e.g., one that floods thenetwork with packets and does not respond to any inputs) can always be shut down or reset. Thepoint-to-point path uses additional pins on the floating connectors of the dc power distributionsystem. System reliability-deferred maintenance and fail-in-place 8.1. Overview The brick architecture as defined herein requires that, if any brick fails, the system will continueto function without noticeable impact. This is particularly important for 3D systems, in whichreplacement of bricks is impractical. Very large systems must use a failure-tolerant architecture. Petascale systems contain millions ofelectronic components, and the chance that they are all working at the same time is negligible[13]. Once a system is sufficiently failure-resilient, maintenance may be deferred. This feature ishighly valued by customers because it eliminates the cost of ongoing maintenance and avoids thepotential for human error when performing the maintenance. According to IBM internal studies,the latter has become a very significant source of system outages. The ultimate goal is to build asystem that requires no maintenance during its entire lifetime. Whether or not this goal can beachieved depends on both the reliability of system components and the correctness of thesoftware managing it. A companion paper [3] in this issue presents a detailed analysis of fail-in-place and deferredmaintenance. This section presents only a summary of key results. DEPT OF IT, PDCE 2009-2010 Page 31 COLLECTIVE INTELLIGENT BRICKSPreventing data loss via dRAID The implementation of deferred maintenance for storage servers requires that data be replicatedover multiple bricks so that the failure of one (or more) disks or bricks does not lead to loss ofuser data. The algorithm for placement of redundant copies of data over multiple bricks is calleddistributed RAID (dRAID) [14]. The following example illustrates the basic idea of dRAID. It assumes that a file is mirrored onbrick A and brick B. If brick A fails, the file can still be retrieved from B. However, if B fails,the data will be lost. Therefore, after the failure of A, system software is required to copy the file(while maintaining coherency in the face of concurrent updates to the data) from good brick B toa third good brick C. There is a time window of vulnerability after the failure of A and before thecopy operation from B to C is completed. If the file is very valuable, it can be stored on morethan two bricks, thus further reducing the probability of data loss. A simple N-way mirroring (i.e., more than two copies of data are created) is not a good tradeoffbetween cost and reliability, but performs very well when writing. Various RAID schemes are inwide use today, in particular RAID 5, in which the exclusive-OR of all data on N disks is writtenonto an additional disk [15]. This allows reconstruction of data if one of the N + 1 disks fails.The brick architecture, with its distributed computing power and high- bandwidth interconnect,enables many versions of dRAID. Its four goals are to accommodate the addition or loss ofbricks, to allow the storage administrator or user to determine the tradeoff between cost andprobability of data loss, to maintain acceptable performance, and to use system-wide sparecapacity efficiently. It is possible to achieve extraordinarily low probabilities for data- lossevents with capacity penalties comparable to those for two-way mirroring, even while assumingthe high failure rates of commodity disk drives [16]. The target reliability is only two data-lossevents per exabyte-year due to multiple failures. The high-bandwidth interconnect of the IceCube mesh is the key enabler for these dRAIDalgorithms. It provides the flexibility to use spare capacity from anywhere in the cube. Thisprevents a common problem with conventional RAID controllers-i.e., the statically preallocatedspare disk capacity for each disk array is small and fixed, making it very difficult to achieveDEPT OF IT, PDCE 2009-2010 Page 32 COLLECTIVE INTELLIGENT BRICKSabove-target reliability unless failed disks are quickly replaced. In such a high-pressure repairsituation,there is also a significant probability of replacing the wrong disk, which itself is aleading cause of data loss. Required maintenance intervals as a function of brick reliability. Brick failures in a 3D mesh This section presents a summary of the effect of brick failures. A detailed analysis is found in[3]. The analysis ignores failures in the IceCube base and errors in the storage software. The system can tolerate the failure of numerous bricks; only after 40% of all bricks have failed isa nonlinear degradation of performance (bandwidths, usable capacity, and I/O) observed. These findings are surprisingly insensitive to the details of the system studied, such as the overallsize of the system and the level of the storage redundancy chosen. For optimum performance, a3D system should be operated with about 70% of its bricks operational. Another conclusion fromthe detailed analysis is that external application server systems should be connected to multiplesurface bricks, either directly or through an external switch. This section discusses the number of spare bricks that must be provided in order to keep a systemoperational for a given period of time without requiring maintenance (adding more bricks). Thisdepends on the hardware failure rate of brick electronics and disks. The results, assuming that thesystem requires a five-9 (0.99999) probability of being available, are shown in Figure 5. Theparameters of the curves are percentages of overprovisioning. If one assumes a rather typicalhardware failure rate of 4.5% per storage brick per year, equally split between electronics anddisks, a system requires an overprovisioning of 25% to support a maintenance- free lifetime of2.5 years. While 25% overprovisioning is high, more realistic scenarios call for a much lowernumber. If a customer buys spare bricks on demand (that is, only when the actual free capacity inthe system falls below a certain level), the total number of spare bricks found in the system aftera few years is much lower; this is because the newer bricks have a higher storage capacity. DEPT OF IT, PDCE 2009-2010 Page 33 COLLECTIVE INTELLIGENT BRICKS8.2. Software The software currently running on IceCube provides a distributed, scale-out file system. Thelong-term goals are to provide a reliable high-performance file service, decrease administrativecosts, and increase system efficiency. For the software discussion, it is assumed that the system is used as a storage server. The presentstate and near-term plans for the software are discussed. In a role as storage server, no end- userapplication software is expected to run on the brick processors; only a restricted set of open- source and IBM-owned programs are used. End-user applications run on compute-orientedservers, such as Blue Gene*/L [17], which are collectively called application servers. Major components The software includes four distinct elements: a distributed file system for storing data in thebricks, a monitoring and control system for safety and power control of the hardware modules, aself- management system for analyzing system state and performing recovery actions forhardware and software problems, and a user interface for configuration and reporting. The system software residing on IceCube and its attached application servers is shown in Figure6. The system software includes software running on a management processor connected to theIceCube base in addition to the software running in each of the IceCube bricks.7 The OS, themesh routing protocol software, and a thin layer of IceCube-specific low-level modules thatprovide hardware monitoring and control together form the operating environment. Two major pieces of user-level code reside on top of the operating environment. These are theIBM General Parallel File System (GPFS) [18] and Kybos (the companion research projectbuilding the software for the IceCube prototype). GPFS is a clustered file system productdeveloped originally for the IBM SP* family of parallel supercomputers and has been deployedon SP systems with up to 2,000 nodes. Kybos provides the self-management software and amanagement interface to the administrator's Web browser. Operating environment DEPT OF IT, PDCE 2009-2010 Page 34 COLLECTIVE INTELLIGENT BRICKSThe operating environment provides the low-level software platform on which distributed self- managing storage services run. All executable software is stored on a flash memory inside eachbrick. Each flash contains two different versions of the software. Thus, if a new code loadrenders a brick inoperative, one can revert to the previous version. For performance reasons, theexecutable software is copied onto disks during boot. Software stack implemented on IceCube. Block diagram of the IceCube base. Linux** is used for both the bricks and the management processor. Certain drivers for thespecific brick hardware (that is, the NIC and the eight-port switch) have been built into the Linuxkernel (v2.4) used in the bricks.As shown in Figure 7, the management processor interfaces directly with the industrialcontrol modules in the base, which connect to thermal, water flow, and brick presence sensorsand also drive power control signals to each brick. Because of this direct connection, themanagement processor has a powerful tool for restoring system stability after failures aredetected.Each brick individually monitors processor and drive temperatures and reports the databack to the management processor using a Linux cluster management tool called Ganglia [19].The management processor runs software that monitors and records the brick states in adatabase. This happens approximately once per second and also serves as a heartbeat detector forthe bricks. This database is read by the Kybos software, which takes appropriate action (asdescribed below in the Kybos section) and displays the current state of the bricks, as shown inFigure 8. The main part of the screen shows performance statistics gathered from the bricks. Thewindow in the left corner shows overall status and parameters for the cube, such as voltages,on/off status, and coolant flow rates. Sensor data from within one brick, which is selected andhighlighted with an orange circle and labeled "2/3/1," is displayed. Control panel for the IceCube operating environment. DEPT OF IT, PDCE 2009-2010 Page 35 COLLECTIVE INTELLIGENT BRICKSNetwork communications The switch chip within each brick must be supplied with a routing table. The routingtable depends on the topology of the system, including the cube itself and external switches andapplication servers. This has been accomplished with the IceCube mesh routing control software.The commonly used spanning tree algorithm is unsuitable for a highly connected 3D topologywith loops. Instead, an algorithm determines possible communication paths between pairs ofnodes and selects one with the minimum hop count. The selected paths remain in use until thetopology of the mesh changes by the addition or failure of a brick. Possible improvementsinclude the use of several shortest-distance paths in parallel. Note that there is a danger offorming loops if there are connections to an external switch. Special care is taken to ensure thatmessages that enter the cube and are not targeted for a specific brick (such as broadcastmessages) are intercepted before they leave the cube through another brick. Ethernet connects the management processor directly to a brick in one corner, which serves asthe origin of the internal Cartesian coordinate system of the cube. The position server process inthe management processor provides that service, and each brick can determine its position withreference to the origin. A Dynamic Host Configuration Protocol (DHCP) server in the management processor issues IPaddresses to all bricks, enabling TCP/ IP and UDP (User Datagram Protocol) communicationbetween bricks and between bricks and application servers. 8.3. Gpfs GPFS provides the basic I/O service layer for the system. As seen in Figure 6, the IceCube GPFScluster has services running in the management processor and in the bricks. The managementprocessor GPFS node serves as the sole quorum node for the IceCube GPFS cluster. GPFSservices running in the bricks are therefore not required for a quorum (i.e., most bricks could failand the GPFS cluster would continue to operate). The major GPFS function running on all of thebricks is the network storage device (NSD) service. This GPFS service exports the disks in thebricks as logically shared disks, instructing GPFS that all I/O requests for those disks should bedirected to the NSD service exporting the disk. DEPT OF IT, PDCE 2009-2010 Page 36 "

Why US?

Because we aim to spread high-quality education or digital products, thus our services are used worldwide.
Few Reasons to Build Trust with Students.

128+

Countries

24x7

Hours of Working

89.2 %

Customer Retention

9521+

Experts Team

7+

Years of Business

9,67,789 +

Solved Problems

Search Solved Classroom Assignments & Textbook Solutions

A huge collection of quality study resources. More than 18,98,789 solved problems, classroom assignments, textbooks solutions.

Scroll to Top