Syllabus   Blank Homework  
Notes   Labs   Scores   Blank

Lecture Notes
Dr. Tong Lai Yu, March 2010
    0. Review and Overview
    1. An Introduction to Distributed Systems
    2. Deadlocks
    3. Distributed Systems Architecture
    4. Processes
    5. Communication
    6. Distributed OS Theories
        7. Distributed Mutual Exclusions
    8. Agreement Protocols
    9. Distributed Scheduling
    10. Distributed Resource Management
    11. Recovery and Fault Tolerance
    12. Security and Protection
    In a few hundred years, when the history of our time
    is written from a long-term perspective, it is likely
    that the most important event those historians will see
    is not technology, not the Internet, not e-commerce. It
    is an unprecedented change in human condition.  For the
    first time, they will have to manage themselves.
    					Peter Drucker
    Distributed OS Theories
    1. Inherent Limitations of a Distributed System

    2. Absence of Global clock
      • difficult to make temporal order of events
      • difficult to collect up-to-date information on the state of the entire system

    3. Absence of Shared Memory
      • no up-to-date state of the entire system to any individual process as there's no shared memory
      • coherent view -- all observations of different processes ( computers ) are made at the same physical time we can obtain a coherent but partial view of the system or
        incoherent view of the system
      • complete view ( global state ) -- local views ( local states ) + messages in transit
        difficult to obtain a coherent global state

    4. Clock Synchronization

      Physical Clocks

      Sometimes we simply need the exact time, not just an ordering.

      Universal Coordinated Time (UTC):

    5. Based on the number of transitions per second of the cesium 133 atom
      (pretty accurate).
    6. At present, the real time is taken as the average of some 50
      cesium-clocks around the world.
    7. Introduces a leap second from time to time to compensate that days are
      getting longer.

    8. Note
      UTC is broadcast through short wave radio and satellite. Satellites can give
      an accuracy of about ±0.5 ms.

      Suppose we have a distributed system with a UTC-receiver
      somewhere in it => we still have to distribute its time to each machine.

      Basic principle

    9. Every machine has a timer that generates an interrupt H times per
    10. There is a clock in machine p that ticks on each timer interrupt.
      Denote the value of that clock by Cp(t), where t is UTC time.
    11. Ideally, we have that for each machine p, Cp(t) = t, or, in other
      words, dC/dt = 1.

      In practice: 1 - r ≤ dC / dt ≤ 1 + r.

      Never let two clocks in any system differ by more than δ time units =>
      synchronize at least every δ/(2r) seconds.

    12. Global positioning system

      Basic idea
      You can get an accurate account of time as a side-effect of GPS.

      Assuming that the clocks of the satellites are accurate and

    13. It takes a while before a signal reaches the receiver
    14. The receiver's clock is definitely out of synch with the satellite
    15. Principal operation

    16. Δr : unknown deviation of the receiver's clock.
    17. xr , yr , zr : unknown coordinates of the receiver.
    18. Ti : timestamp on a message from satellite i.
    19. Δi = ( Tnow - Ti ) + Δr : measured delay of the message sent by satellite i.
    20. Measured distance to satellite i: c x Δi
      ( c is speed of light )
    21. Real distance is
    22. Observation
      4 satellites => 4 equations in 4 unknowns ( with Δr as one of them )

      Clock Synchronization Principle

      Principle I
      Every machine asks a time server for the accurate time at least once
      every δ/(2r) seconds (Network Time Protocol).

      Okay, but you need an accurate measure of round trip delay, including
      interrupt handling and processing incoming messages.

      Principle II
      Let the time server scan all machines periodically, calculate an
      average, and inform each machine how it should adjust its time relative
      to its present time.

      Okay, you'll probably get every machine in sync. You don't even need
      to propagate UTC time.

      You'll have to take into account that setting the time back is never
      allowed => smooth adjustments.

    23. Lamport's Logical Clock

      We first need to introduce a notion of ordering before we can order anything.

      The happened before → relation

    24. a → b , if a and b are events in the same process and a occurred before b
    25. a → b , if a is the event of sending a message m in a process and b is the event of receipt of the same message m by another process
    26. if a → b and b → c, then a → c ( transitive )

    27. event a causally affects b if a → b
    28. concurrent: a || b if !( a → b ) and !( b → a )
    29. for any two events in a system, either a → b or b → a or a || b
      e11 → e12   , e12 → e22
      e21 → e13   , e14 || e24
    30. Realization

      To realize the relation → we need a clock Ci at each
      process Pi in the system, and adjust the clock according
      to the following rules.

      Ci(a) -- timestamp of event a at Pi
      if a → b, then C(a) < C(b)

      Condition requirements:

      1. for any two events a and b in a process Pi,
        if a occurs before b, then Ci(a) < Ci(b)
      2. if a is the event of sending a message m in Pi
        and b is the event of receiving the same message m
        at process Pj, then
        Ci(a) < Cj(b)

      Implementation rules:

      1. two successive events in Pi Ci = Ci + d ( d > 0 ) if a and b are two successive events in Pi and a → b then Ci(b) = Ci(a) + d ( d > 0 )
      2. event a: sending of message m by process Pi,
        timestamp of message m : tm = Ci(a ) then Cj = max ( Cj, tm + d )    d > 0

        → is irreflixive, defines partial order among events

        Totally ordering relation ( => ) can be defined by ( on top of the above )

        a is any event in process Pi
        b is any event in process Pj a => b iff
          either Ci(a) < Cj(b)
          or Ci(a) = Cj(b) and Pi Pj ( e.g. Pi Pj if i ≤ j, to break ties )

    31. Limitation of Lamport's Clocks
      if a → b then C(a) < C(b)
      but C(a) < C(b) does not necessarily imply a → b

    32. Positioning of Lamport's logical clocks in distributed systems:

      Example: Totally Ordered Multicasting

        See Figure of inconsistent database update below.

    33. Vector Clocks

      n = number of processes in a distributed system
      Each event in process Pi ~ vector clock Ci ( integer vector of length n )

      Ci = Ci[1]
    34. Ci[i] ~ Pi's own logical clock
    35. Ci[j] ~ Pi's best guess of logical time at Pj. More precisely, the time of occurrence of the last event at Pj which "happenned before" the current point in time at Pj
    36. Ci(a) is referred to as the timestamp of event a at Pi
    37. Comparing two vector timestamps of events a and b

      Equal    ta = tb   iff   all i, ta[i] = tb[i]
      Not Equal    tatb   iff   some i,    ta[i] ≠ tb[i]
      Less Than or Equal    tatb   iff   all i,    ta[i] ≤ tb[i]
      Not Less Than or Equal To    ta tb   iff   some i,    ta[i] > tb[i]
      Less Than    ta < tb   iff     ta ≤ tb and ta ≠ tb )
      Not Less Than    ta tb   iff     !(tatb and tbtb );
      Concurrent    ta || tb   iff      ta tb and tb ta

      Implementation Rules:

      1. two successive events a, b in process Pi: Ci(b)[i] = Ci(a)[i] + d    ( d > 0 )
      2. event a at Pi sending message m to process Pj with receiving event b; vector timestamp tm = Ci(a) is assigned to m; on receiving m, Pj updates Cj as follows: all k, Cj(b)[k] = max(Cj(b)[k],tm[k])

      At any instant

      Events are causally related if ta < tb or tb < ta

      Now, a → b   iff   ta < tb

    38. Global State
    39. no global clock, no global memory
    40. To determine a global system state, a process p must enlist the cooperation of other processes that must record their states and send the recorded local states to p

    41. processes cannot record their local states at precisely the same instant unless they have access to a common clock

    42. the global-state-detection algorithm is to be superimposed on the underlying computation; it must run concurrently with but not alter the underlying computation

    43. Distributed system finite set of processes
      finite set of channels

      process state, channel state

      Example: Updating a replicated database and leaving it in an inconsistent state.

    44. Update 1 : Add $100 to $1000
    45. Update 2 : Calcalate interest
    46. At San Francisco ( Update 1 first ): Add $100 to $1000, then calculate interest.
    47. At New York ( Update 2 first ): Calcalate interest of $1000, then add $100.
    48. Some definitions

    49. LSi -- local state of Si ( site ) (Collection of events occurred.)

    50. events -- send( mij ), recv( mij )

    51. time ( x ) -- time at which state x was recorded
      e.g. time ( LSi )

    52. send ( mij ) ∈ LSi iff time ( send ( mij ) ) < time ( LSi )

    53. recv ( mij ) ∈ LSj iff time ( recv ( mij ) ) < time ( LSj )

    54. transit ( LSi, LSj ) = { mij | send( mij ) ∈ LSi Λ recv( mij ) !∈ LSj }
      i.e. message in channel

    55. inconsistent ( LSi, LSj ) = { mij | send( mij) !∈ LSi Λ recv( mij ) ∈ LSj }

    56. Global State GS = { LS1, LS2, ..., LSn }
      i.e. collection of local states ( may be consistent or inconsistent )

    57. Consistent Global State: A global state GS = { LS1, LS2, ..., LSn } is consistent iff all i, all j: 1 ≤ i, j ≤ n :: inconsistent( LSi, LSj ) = Φ

    58. Transitless global state: A global state is transitless iff all i, all j: 1 ≤ i, j ≤ n :: transit( LSi, LSj ) = Φ

    59. Strongly consistent global state: A global state is strongly consistent if it is consistent and transitless.

    60. Causal Ordering of Messages

      if Send( M1 ) → Send( M2 )
      then the receipient should receive M1 before M2

      i.e. Send( M1 ) → Send( M2 ) requires Receive( M1 ) → Receive( M2 )

      Figure: Violation of causal ordering of messages

    61. Applications: database replication management, monitoring distributed computations, simplifying distributed algorithms,...

    62. Solution idea: upon arrival of a message at a process, buffer (delay delivery) the message until the message immediately preceding it is delivered

    63. Birman-Schiper-Stephenson Protocol: Enforcing Causal Ordering of Messages

      Assumes broadcast communication channels that do not lose or corrupt messages. ( i.e. everyone talks to everyone ). Use vector clocks to "count" number of messages ( i.e. set d = 1 ). n processes.

      Vector Time:

      1. When Pi begins to execute, Ci is initialized to zeros.
      2. For each event send( m ) at Pi, Ci[i] is incremented by 1.
      3. Time stamp tm = Ci is sent along with m.
      4. When process Pj delivers a message m from Pi, Pj updates its vector clock: all k ∈ {1, 2, ..n} : Cj[k] = max ( Cj[k], tm[k] ) ( Note: Recv ( m ) -> Deliver ( m ) )

      The Protocol:

      1. Process Pi updates vector time Ci and broadcasts message m with timestamp tm = Ci.
        So Ci[i] - 1 is the number of messages sent before m.

        (Note: A process updates its value of the vector clock only when it sends a message.
        It doesn't update its own value when receiving a message; it adjusts the vector clock when it delivers the message. )
      2. Process Pj ( j ≠ i ) upon receiving message m with timestamp tm, Pj buffers the message until
        • all messages sent by Pi preceding m have arrived i.e. Cj[i] = tm[i] - 1

        • Pj has received all messages that Pi had received before sending m. i.e. Cj[k] ≥ tm[k]    k = 1, 2, .. n, k ≠ i
      3. When the message is finally delivered at Pj, vector time Cj is adjusted according to vector clock rule 2. Do not use rule 1 here.


      Schiper-Eggli-Sandoz were able to solve the problem without broadcasting channels

    64. Global-State-Detection Algorithm Send a special message called marker

      Chandy-Lamport Global State Recording Protocol ( Snapshot Algorithm )

      The goal of this distributed algorithm is to capture a consistent global state. It assumes all communication channels are FIFO. It uses a distinguished message called a marker to start the algorithm.

    65. Pi sends marker
      1. Pi records its local state
      2. For each channel Cij on which Pi has not already sent a marker, Pi sends a marker before sending other messages.

    66. Pj receives marker from Pi
      1. If Pj has not recorded its state:
        • a) Records the state of Cij as empty
        • b) Sends the marker as described above ( Note: it records local state before sending out marker )

      2. If Pj has recorded its state local state LSj
        • a) Record the state of Cij to be the sequence of messages received between the computation of LSj and the marker from Cij.


      In this example, all processes are connected by communications channels Cij. Messages being sent over the channels are represented by arrows between the processes.

      Snapshot s1:

      • P1 records LS1, sends markers on C12 and C13
      • P2 receives marker from P1 on C12; it records its state LS2, records state of C12 as empty, and sends marker on C21 and C23
      • P3 receives marker from P1 on C13; it records its state LS3, records state of C13 as empty, and sends markers on C31 and C32.
      • P1 receives marker from P2 on C21; as LS1 is recorded, it records the state of C21 as empty.
      • P1 receives marker from P3 on C31; as LS1 is recorded, it records the state of C31 as empty.
      • P2 receives marker from P3 on C32; as LS2 is recorded, it records the state of C32 as empty.
      • P3 receives marker from P2 on C23; as LS3 is recorded, it records the state of C23 as empty.

      Snapshot s2: now a message is in transit on C12 and C21.

      • P1 records LS1, sends markers on C12 and C13
      • P2 receives marker from P1 on C12 after the message from P1 arrives; it records its state LS2, records state of C12 as empty, and sends marker on C21 and C23
      • P3 receives marker from P1 on C13; it records its state LS3, records state of C13 as empty, and sends markers on C31 and C32.
      • P1 receives marker from P2 on C21; as LS1 is recorded, and a message has arrived since LS1 was recorded, it records the state of C21 as containing that message.
      • P1 receives marker from P3 on C31; as LS1 is recorded, it records the state of C31 as empty.
      • P2 receives marker from P3 on C32; as LS2 is recorded, it records the state of C32 as empty.
      • P3 receives marker from P2 on C23; as LS3 is recorded, it records the state of C23 as empty.

      The recorded process states and channel states must be collected and assembled to form the global state. ( e.g. send G.S. to all processes in finite time )

      each process must ensure that

      • no marker remains forever in an incident input channel
      • it records its state within finite time of initiation of the algorithm
    67. Cuts of a distributed Computation

      Graphical representation of GS

      C = { c1, c2, ... ,cn }
      ci -- cut event, local state of site ( or process ) Si at that instant

    68. Consistent Cut:

      all Si, all Sj, no ei, no ej such that

        ( ei → ej ) and ( ei → cj ) and ( ei ci )

        i.e. every message received before a cut event was sent before the cut event
        at the sender site in the cut.

      Inconsistent Cut

    69. Theorem
        A cut C = { c1, c2, ... ,cn } is a consistent cut iff no two cut events are causally related. ( i.e. every pair of cut events are concurrent )

      Time of a cut

        C = { c1, c2, ... ,cn }

        Ci -- vector clock of ci

        TC = sup ( C1, C2, ... , Cn )

        TC[k] = max ( C1[k], C2[k], ... , Cn[k] )

    70. Theorem
      if C = { c1, c2, ... ,cn } is a cut with vector time TC, then the cut is consistent iff
      TC = C1[1]
        -------------- (1)
      If C is a consistent cut, then all its events are concurrent. Thus Ci[i] ≥ Cj[i] for all i, j and hence
      TC = sup ( C1, C2, ... , Cn ) = C1[1]

      On the other hand if (1) is true
      we have Ci[i] ≥ Cj[i] for all i, j. This implies that the the events ci are concurrent and the cut is consistent.

    71. Termination Detection

      System Model

    72. A process may either be in active or inactive state.
    73. An idle process becomes active upon receiving a computation message.
    74. If all process idle => computation terminated.
    75. Huang's Termination Detection Protocol:

    76. The goal of this protocol is to detect when a distributed computation terminates.
    77. n processes
    78. Pi process; without loss of generality, let P0 be the controlling agent
    79. Wi. weight of process Pi; initially, W0 = 1 and Wi = 0 for all other i.
    80. B(W) computation message with assigned weight W
    81. C(W) control message sent from process to controlling agent with assigned weight W
    82. Protocol

    83. an active process Pi sends a computation message to Pj
      1. Set Wi' and Wij to values such that Wi' + Wij = Wi,
        Wi' > 0, Wij > 0. (Wi' is the new weight of Pi.)
      2. Send B(Wij) to Pj
    84. Pj receives a computation message B(Wij) from Pi
      1. Wj = Wj + Wij
      2. If Pj is idle, Pj becomes active

    85. Pi becomes idle by:
      1. Send C(Wi) to P0 ( or to another Process )
      2. Wi = 0
      3. Pi becomes idle

    86. Pi receives a control message C(W):
      1. Wi = Wi + W
      2. If Wi = 1, the computation has completed.
    87. Example

    88. The picture shows a process P0, designated the controlling agent, with W0 = 1. It asks P1 and P2 to do some computation. It sets
        W01 = 0.2
        W02 = 0.3
        W0 = 0.5
    89. P2 in turn asks P3 and P4 to do some computations. It sets
        W23 = 0.1
        W24 = 0.1

    90. When P3 terminates, it sends C(W3) = C(0.1) to P2, which changes W2 to 0.1 + 0.1 = 0.2.

    91. When P2 terminates, it sends C(W2) = C(0.2) to P0, which changes W0 to 0.5 + 0.2 = 0.7.

    92. When P4 terminates, it sends C(W4) = C(0.1) to P0, which changes W0 to 0.7 + 0.1 = 0.8.

    93. When P1 terminates, it sends C(W1) = C(0.2) to P0, which changes W0 to 0.8 + 0.2 = 1.

    94. P0 thereupon concludes that the computation is finished.

      Total number of messages passed: 8 (one to start each computation, one to return the weight).