Lecture2distributedsystem分布式系统课程.ppt
文本预览下载声明
What is Different in Distributed? Higher inter-CPU communication latency Individual nodes need to act more autonomously Different nodes can be heterogeneous (by function, location…) System reliability is much harder to maintain “A distributed system is one in which the failure of a computer you didnt even know existed can render your own computer unusable” -- Leslie Lamport Reliability Demands Support partial failure Total system must support graceful decline in application performance rather than a full halt Reliability Demands Data Recoverability If components fail, their workload must be picked up by still-functioning units Reliability Demands Individual Recoverability Nodes that fail and restart must be able to rejoin the group activity without a full group restart Reliability Demands Scalability Adding increased load to a system should not cause outright failure, but a graceful decline Increasing resources should support a proportional increase in load capacity Reliability Demands Security The entire system should be impervious to unauthorized access Requires considering many more attack vectors than single-machine systems Ken Arnold, CORBA designer: “Failure is the defining difference between distributed and local programming” Component Failure Individual nodes simply stop Data Failure Packets omitted by overtaxed router Or dropped by full receive-buffer in kernel Corrupt data retrieved from disk or net Network Failure External internal links can die Some can be routed around in ring or mesh topology Star topology may cause individual nodes to appear to halt Tree topology may cause “split” Messages may be sent multiple times or not at all or in corrupted form… Timing Failure Temporal properties may be violated Lack of “heartbeat” message may be interpreted as component halt Clock skew between nodes may confuse version-aware data readers Byzantine Failure Difficult-to-reason-about circumstances arise Commands sent to foreign node are not confirmed:
显示全部