Unreliable Broadcast Service

  1. Introduction
  2. Design Philosophy

  • Autonomous Clients
  • Split Data Paths
  • TCP
  • UDP Multicast
  • Buffering
  • Retransmission (NAK)
  • Client HeartBeat (ACK)
  • Server HeartBeat
  • Synchronisation of UDP/TCP data streams
  • Security
  • Implementation

  • Language
  • Event Driven
  • Shared Data
  • Timing
  • Data Paths (skirmish level)
  • Data Structures
  • Protocol
  • Supporting functions
  • Forward Strategy
  • Glossary

    1. Introduction

    2. Requirement

      "A communications sub-system is required
      to provide sequenced bi-directional data transfer
      between the Central Server of a Trading System
      and each of the outlying Traders' workstations."


      In this Service, the unit of transfer is the Message.
      In order to limit the bandwidth needed to provide the required service, common information may be sent to all Traders by means of multicast messages via UDP.
      The Unreliable Broadcast Service was developed in the clear understanding
      that UDP communications are inherently unreliable:
      therefore a user-level protocol layer is needed to help meet the Project Requirements of guaranteed delivery and fairness to all Clients.
      The sub-system allows access to registered users & monitors the integrity of connections.


      Compliance People - Please note:
      "If a communications line is poor, then you cannot guarantee delivery; and fairness is totally unachievable".


    3. Design Philosophy


      1. Autonomous Clients

      2. Each Client is autonomous, effectively behaving as if there were no other Client.
        A Client may notice a WindBack which was requested by another Client. This should be ignored, unless it contains any message(s) needed by this Client.
        Several Clients may share a WAN comms line. In this case the bandwidth available should not be exceeded by the aggregate data volume sent, in either direction.

      3. Split Data Paths

      4. The approach taken is to use UDP multicasts from UB server to all UB Clients for public messages and to use TCP unicasts from Server to a Client or Client to Server for "private" messages.

      5. TCP

      6. This protocol automatically guarantees delivery & sequency for unicasts in each direction. At least, up to the limitations of TCP/IP stream responsiveness.

      7. UDP Multicast

      8. UDP multicasts are unreliable.
        UB Server/Clients must bolster-up the delivery & sequency performance of UDP.
        Messages may be undelivered or delivered twice (or more) or delivered out-of-sequence.
        Therefore "Sequence Numbers" are added to UDP messages by the Ubs so that the receiving Ubc's can re-establish sequency, ignore repetitions & request re-transmission of any messages not received.

      9. Buffering

      10. Unicast messages are stored (in a transmit Q) until they are sent.
        When the write() sys-call returns satisfactorily, the message buffer is free()'d (because TCP guarantees delivery, unless the channel dies completely).
        Multicast messages are stored until all connected Clients have received them.
        Of course, if any Client gives up and disconnects we no longer need to store any messages for that Client.
        Received messages are stored in a receive Q until they can be forwarded appropriately.

      11. Retransmission (NAK)

      12. If a Client realises that it has missed one or more messages it should request retransmission of the appropriate message(s).
        This is the Negative AcKnowledgement part of the protocol.
        When the UBServer receives a NAK it initiates a re-transmission procedure.
        UB Client should re-request re-transmission if a WindBack is not noticed within a reasonable time after each request.

      13. Client HeartBeat (ACK)

      14. Periodically, each Client sends a HeartBeat message to UBServer to indicate the latest UDP message received (with no prior missing messages).
        This is the positive ACKnowledgement part of the protocol.
        UB Server uses this information to free space in the Multicast Q (Purge operation).

      15. Server HeartBeat (belt & braces)

      16. As the ACK information is not used to request retransmission, the UB Server periodically issues a HeartBeat to allow a dozy Client another chance to realise that it has missed something.
        If any Client does not receive the Server HeartBeats, it probably would not receive re-transmitted data & so eventually it will either be disconnected, because the TCP stream fails or it will be tailed off by the MultQ filling up.

      17. Synchronisation of UDP/TCP data streams

      18. It is necessary to re-synchronise the arrival of unicast & multicast messages at each UB Client.
        For this, the Sync part of the protocol is used.
        When the UB Server notices that an incoming message from CS is of a different mode to the previous message from CS, the UB Server sends a Sync message on the current stream before sending the new message. If the current mode is Unicast, then the message must be sent to all connected Clients.
        At the UB Client, receipt of Sync message indicates that the WEP output stream should switch to the other input mode and, if necessary, wait for further messages on that stream.

      19. Security

        1. Authentication
        2. A Ubc will initiate communications by performing a TCP connect to one of its Ubs host machines.
          A Ubs which knows it is a Slave should direct the Client to the current Master.
          Minimal authentication is performed by the Ubs - merely checking the IP address of the caller.
          An invader should be disconnected.
          It is left to the Trading system to perform, say, a DNA check on the Trader.

        3. Eavesdropping
        4. It is possible for messages to be received by an unauthorised host on the network.
          This would provide a data-feed from the Trading System. If this is a serious problem, the message data may be encrypted/decrypted.
          Alternatively, or additionally, part of the Multicast Receive Address may be specified pseudo-randomly by the Ubs.
          However, the Multicast Receive Address should not be changed during Operation an certainly not during Trading.
          Suggestion: use byte-2 of IP address as Trading System Identifier,
          byte-3 as randomly chosen at Ubs startup time
          and byte-4 as low byte of Ubs host IP address.
          Note that only the low 23 bits of IP address are mapped to Ethernet multicast address.


    4. Implementation


      1. Language
        1. Coded in C, with // comments for clarity
        2. The project uses ANSI-C to ease maintainability
        3. whilst avoiding the need for substantial staff re-training.

        4. Compiled with `xlc -g -qlanglvl=extended -qcpluscmt' (the local C compiler) to allow use of // -style comments.
        5. This apparently upset "insight", so David S. changed the comment style to /*...*/, i.e. K&R C.

      2. Event Driven
        1. The UB Server & Client are each completely event-driven.
        2. Events are handed to the UB Server/Client user-level program by the select() sys-call.
        3. Server/Client deals with events one-at-a-time.
        4. When an event occurs:
          • checks are made to establish what can be done
          • appropriate non-blocking operations do as much as they can
          • if necessary, request further action.

        5. Input/output is strictly non-blocking, to allow other data streams in the Server/Client to continue independently.
        6. Interaction between different parts of the Server/Client is by means of shared data areas, message queues & timers.

      3. Shared Data
        1. Clients[] array of parameters on a per-Client basis.
        2. poll_a, poll_b, poll_c fd_set's for select() & transmit control.
        3. MultQ, the multicast circular buffer of size MULT_SIZE.
        4. Multicast static pointers.

      4. Timing
        1. Timing Sub-system provides a one-shot timer for each connected Client in Clients[].
        2. The timer may be used as:
          • A periodic clock (e.g. for heartbeats) by re-triggering the timer with the required period.
          • A "choke" timer (used to prevent writing to WAN when router buffer space may be insufficient).
          • A "time-'til-end" timer (used to propagate knowledge of timing from one write attempt to next write attempt on that stream).

        3. Timers are only used by write() & sendto() related operations.
          Timers are not used for read() & recvfrom() operations. (They are asynchronous, with RxData interrupts directed to application by select() sys-call.)
        4. Timing is not critical, provided no delay is shortened.
          Reduction of time delays is likely to cause catastrophic fall in efficiency of circuit use.
          Relaxation of timing merely reduces the efficiency of circuit use, on a pro-rata basis.
        5. Multiple channel timeouts are handled by using a list of incremental timeouts and by only dealing with the first in the list.
        6. The First relative timing is handled by the select() timeout parameter.
        7. Sometimes the timer will be interrupted by `ready for i/o', so the timer list has to be updated to the "time remaining" from the select() sys-call and restored after the i/o has been performed.

      5. Data Paths (skirmish level)
        1. Setup/tear down of comms channel
          1. Listening for incoming connections
            • Ubs listens for TCP connections from Ubc's and FallBack Ubs.
            • Ubc listens for TCP connection from WEP.

          2. Accepting incoming connections
            • Ubs accepts incoming TCP connections and checks for FallBack Ubs.
            • All other connections are treated as Ubc's and are validated by IP address
            • Ubs sends messages to Central Server to indicate new user ready.

          3. Connecting to remote hosts
            • Ubc attempts to connect to Ubs.
            • If it fails to connect, Ubc retries to connect after a delay.

          4. Disconnecting from remote hosts
            • Disconnection is performed implicitly by TCP.
            • The other end of the connection is notified and passes on the bad news.
              • If CS dies, Ubs closes the Ubc ports.
              • If a Ubc dies, the Ubs sends a User disconnect message to Central Server.
              • If a WEP dies, Ubc closes the comms port to Ubs.
              • If the Ubs port dies, Ubc should wait for cutover period, then close the WEP port if no Usurp attempt has been received.

        2. Unicast data path
          1. Read data from local TCP
          2. Check that message is destined for a valid IP address (UBServer end only)
          3. Q to appropriate remote TCP
          4. Transmit TCP via WAN
          5. Read data from remote TCP
          6. Q data to local TCP port
          7. Transmit TCP to local port

        3. Multicast data path (CS to all connected WEP's)
          1. Read data from local port as TCP message (CS)
          2. Q message to multicast (MultQ)
          3. Send as UDP multicast message via WAN
          4. All Ubc's read data from UDP receive port
          5. Q data to local TCP port (WEP)
          6. Transmit as TCP message to local port

        4. Messy Details
          1. New Client Connection (Con)
            • Ubc connects to Ubs
            • Ubs adds new Client to quota share, if on shared circuit
            • Wts requests info from CS
            • CS responds, but Ubs prepends an InitSync message to Ubc
            • Ubs adds new Client into the MultQ backmarker list
            • Ubc uses InitSync info to select unicast source & set the initial m/c seqno

          2. Client Disconnect (Disc)
            • Ubs removes Client from quota share, if on shared circuit
            • Ubs removes Client from the MultQ backmarker list
            • Ubs notifies Central Server of user death

          3. Limiting data rate via WAN (Choke)
            • Algorithm
              • The amount of data sent on a given circuit is monitored by the Ubs so that it can assess when the available bandwidth may be exceeded.
              • As the effects of sending more than the router can cope with are so catastrophic, we tend to err on the safe side.
              • The implementation relies on maintaining a timer (per channel) to indicate when the data which has been sent to the router so far will have been transmitted on the local Ethernet.
              • The unicast path is allowed only half the nominal bandwidth, in order to allow a (very comfortable) overhead for the TCP protocol

            • Quotas for circuit shared by 2 or more Clients
              • Where more than one Client share a given WAN circuit, the available bandwidth is shared on the assumption that all may send simultaneously. Clearly this is the worst case and a better algorithm could be used. However, it is safe and easy to understand.

          4. Retransmission (NAK)
            • Retran Request in UB Client
              • When Ubc notices that it has missed a multicast message, it should:
                • Send a Retran request to the Ubs
                • Start its Retran timer (in case it misses the retransmission)

              • WindBack in UB Server
                • When Ubs receives a Retran Request and it has not got any outstanding retransmissions, it should enter a WindWait state and start its Retran timer appropriately. WindWait is a period during which the Ubs continues sending multicasts and accepts Retran Requests from other Clients.
                • At the end of WindWait the Ubs goes into the state WindIdle and sets its Retran timer. WindIdle is a period when the Ubs stops sending multicasts to allow the WAN to cool off. (There may be a long burst of noise or jabber)
                • After WindIdle the Ubs winds back the MultQ and starts transmitting from the earliest message missed. This is back in the normal transmitting state.

            • Client HeartBeats (ACK)
              • Generation
                • Heartbeats are sent on a regular basis to indicate which multicast messages have been received without any missing.

              • Reception in UB Server
                • When a Heartbeat is received from a Client, the sequence number is used to update that Client's position in the Ubs backmarker list. This is a doubly-linked list of all Clients ordered by ACKnowledged sequence number.
                • If a Client which was the "Oldest" sequence number has progressed, then the MultQ may now be purged as far as the new "Oldest".

            • Server HeartBeats (B&B)
              • Generation
                • The Ubs sends heartbeats regularly to allow a Client to notice that it is not up-to-date with the multicasts.
                • This caters for the situation where the Client lost the last (few) message(s) sent before a hiatus in trading. In such a case, that Client would potentially have out-of-date information until some event caused the CS to send more multicasts.

              • Reception in UB Client
                • Server heartbeat is received as normal, possibly provoking a Retran request to the Ubs.
                • It is not forwarded to the WEP!

            • Synchronising UDP & TCP streams (Sync)
              • Generation in UB Server
                • The Ubs sends a Sync message whenever the next incoming message from CS changes from unicast to multicast, or vice versa.
                • When indicating the change from unicast to multicast the message must be sent to all connected Clients, even those to whom nothing was unicast.

              • Reception in UB Client
                • When Sync messages are received they are Q'd in position with the normal messages.
                • They are only acted upon when they come to be dQ'd in TxLoc(), i.e. ready to send to the WEP. At this point they are used to switch the source stream from multicast to unicast or vice versa, as appropriate.
                • Sync is used to allow each Client to recover the original sequency of multicast messages and unicast messages. InitSync is sent to indicate which multicast message is the first relevent message for a given Client.

      6. Data Structures
        1. Clients[]
          • Clients[] is an array of structures (one per potential Client).
          • The structure contains information which needs to be kept for that Client from one event to the next.
          •                     typedef struct Client           // used for connected Clients
                                {
                                    struct Client   *next;      // next Client in TxTimer list
                                    u_long          in_addr;    // inet address (in host form)
                                    u_short         fd;         // file descriptor used for this Client
                                    u_short         delta;      // incremental time until TxTimer due
                                    u_short         quota;      // number of Clients sharing BW
                                    long            rewind;     // Multicast seq_no to replay from for this Client, or (-1)
                                    u_short         state;      // sub-state for this fd/stream
                                    u_short         diag;       // per-stream diagnostic mask
                                    qio             r;          // read parameters
                                    qio             w;          // write parameters
                                    rcvd            recv;       // recv Client list header
                                } Client;
            					

        2. poll_a,b,c

          • poll_a, poll_b & poll_c are used as driving arrays for the select() sys-call.
          • poll_a specifies active file descriptors.
          • poll_b specifies busy-write file descriptors.
          • poll_c specifies connected, cleared-Q, or completed-write file descriptors.

          Tx state on any stream is identified by the combination of poll_b & poll_c for that stream:

          State poll_bpoll_cchangedown change up
          Idle clr set Rx new msg -
          Proddedset set Tx selectedTx q empty
          Sendingset clr msg too bigend of msg sent
          Choked clr clr - choke timeout

          Only need to Prod if Idle; Must not Prod if Choked
          May re-Prod if Prodded or Sending without ill effects.
          Use the Prod() macro to achieve this:

                                  # define Prod( fd )                       \
                                      do                                    \
                                      {                                     \
                                          if ( FD_ISSET ( fd, &poll_c ))    \
                                              FD_SET ( fd, &poll_b );       \
                                      } while (0)
          				

        3. Q's
          • Most message Q's are linked lists, with new messages usually added at the tail pointer & messages removed from the head pointer.
          • All messages are read() / recvfrom()'ed into a buffer which has been added to the appropriate Q.
          • In some cases it is necessary to add messages at the head of the Q, in order to change the sequence of sending messages.

        4. MultQ
          • Multicast Q's are circular buffers for UDP multicast messages.
          • They are indexed by multicast message sequence number, to facilitate rapid access.
          • UB Server
            • Ubs MultQ has messages added sequentially from the Central Server which are required to be delivered to all connected Clients.
            • Ubs may need to access the messages stored in the MultQ in non-sequential order, if a WindBack is requested.
            • Ubs will (eventually) purge old messages from the MultQ (only when all connected Clients have acknowledged receipt of them).

          • UB Client
            • The MultQ is used to store incoming UDP multicast messages, which may arrive out-of-sequence.
            • Each Normal message is forwarded to the Client when there are no previous missing messages.
            • When a Sync message is the next to be removed from MultQ, the data source is switched to the TCP unicast stream.
            • When messages are moved to the TxLoc Q (for sending to the WEP), the slot they occupied in MultQ is released.

      7. Protocol
      8. The message data passed are TSI messages with the following abbreviated structure:

        		struct TSI_MASSAGE
        		{
        			u_short tsi_msg_no;         // normally known as a "message type"
        			u_short tsi_msg_len;        // length of message (including header)
        			u_short tsi_user_no;        // user number
        			u_char  tsi_system_id;      // Destination system id
        			u_long  tsi_inet_addr;      // inet address of recipient
        			...
        		};
        
        		Comms messages conform to the following structure:
        
        		typedef struct msg
        		{
        			u_short type;       // comms message type
        			u_short len;        // message length (including header)
        			u_short seqno;      // last-used m/c sequence number
        			u_char  msg[ 0 ];   // buffer for unmolested TSI message
        		} msg;
        		

        All messages are read in two chunks:

        1. Hdr: read header (6 bytes) & malloc a buffer.
        2. Msg: read remainder of message based on (length found in header) - 6.

        Message Types

        			0   *** Normal   Normal TSI data message.
        
        			1    ** Ctrl     UBS to all UBC's (CS up/down, New UBS);
        							 & UBC heartbeat to UBS.
        			2   * * Re-tran  Re-transmission request
        							 (or reply to one Client).
        			3   **  Sync     Mode change between unicast
        							 & multicast from CS.
        			4   *   SyncInit Initialise synchronisation at UBC.
        
        			Pro tut         // Protocol: UDP or TCP stream.
        			Fr: ssc         // From...
        			To: ccs         // To...
        		

        Server to Client

        			Multicast
        			0   Normal TSI message to all WEP's.
        			1   Control message to all UBC's.
        					HeartBeats
        					CS up/down.
        					New UBS.
        			3   Sync to indicate change mode to unicasts from UBS.
        
        			Unicast
        			0   Normal private TSI's from CS to this WEP.
        			2   Re-tran for this UBC only; no Sync/Ack required.
        			3   Sync to indicate change mode to multicasts from UBS.
        			4   SyncInit to start synchronising operations at Ubc
        		

        Client to Server

        			Unicast
        			0   Normal TSI message from WEP to CS.
        			2   Retran request for missed m/c message.
        		

        Heartbeats are sent by all connected Ubc's to Ubs at regular intervals (5s default). Hearts are prodded by HeartFd timeout. Heartbeat message is a Ctrl message with length of 6 (i.e. just the comms header). The latest m/c seq_no is included in the comms header.

        Ubs sends heartbeats, to keep every Ubc up-to-date. Ubs heartbeats are not strictly necessary for this protocol; however, they'll keep Tom happy.

        When Ubc detects a missed multicast message, it requests a Retran & sets a timer (RetranFd). If a WindBack is not noticed before the Retran timer matures, then another Retran is requested and the timer is again primed. This caters for the case of a retran not being received (cos the WAN was still poorly) and no further multicasts have been sent since.

        Unicast messages are self-sequenced by TCP. This completely achieves sequency for Ubc to Ubs direction. Also, any unicasts for a given Ubc will appear in sequence.

        To ensure the correct interleaving of m/c & u/c messages, unicast messages carry the last m/c sequence number.

        Retain TCPq messages until the appropriate m/c Sync has been established in RxRemQ.

        Ubs should send a m/c Sync message to indicate that next message from CS is a unicast. This prevents a Ubc from zooming past its unicast which TCP is having trouble pushing through.

        When CS next sends a multicast, the Ubs should indicate the mode change by sending a unicast Sync on all connected Ubc's first.

        This will take about 10ms of circuit bw (in parallel: any number of Ubc's). But it may take .5s to 3s to tell 2000 Ubc's (sequential TCP access to router via ethernet).

        Sync message is essentially like "Over" in a R/T call: i.e. when a change of mode is necessary a Sync message is sent on the old mode.

        On the TCP stream socket, unicasts following a Sync on TCP cannot arrive before the Sync. Thus we can guarantee one or more multicasts appear in the correct sequence between unicasts.

        On the other hand, embedded unicasts in multicasts are guaranteed sequency because the Sync on multicast is allocated a seq_no & the following multicasts are not forwarded to WEP until all previous messages have been received.

        Re-transmission request from Ubc to Ubs includes the seq_no after the missing message(s). This is sent in network byte order as a u_short in msg[ 0 ].

        28 byte UDP/IP header

    5. Supporting functions

      1. Non-blocking input/output
        1. Re-startable read/write routines
          • The sys-calls read() & write() on TCP protocol and on unix pipes can be interrupted by signals or scheduling.
          • Use is made of read_n() & write_n() to read or write "n" bytes of data respectively.
          • Essentially, this means that every unicast read/write operation is run as a thread which may be restarted.
          • The state of each thread is stored in the per-Client read/write parameters.

        2. Unix signals are blocked except within the select() sys-call.
          • This localises the region in which signals are allowed to have an effect on operations.
          • Signal handlers may set flags etc., secure in the knowledge that nothing much else is happening (so we don't need to use semaphores).

        3. sendto()
          • Sendto is used for UDP multicast output and is inherently non-blocking, provided the message to be sent is not too large ( < 8000 bytes is a good rule-of-thumb ).
          • If select() indicates that the multicast file descriptor is ready for write then the system should allow an uninterrupted send to complete.

        4. recvfrom()
          • Recvfrom is used for UDP multicast input and is inherently non-blocking.
          • We use a MSG_PEEK to look at the comms header (6 bytes) to establish the message length, then perform a normal recvfrom() to gather the whole message, once we've malloc()'ed the correct sized buffer.

      2. Disconnection

        • Disconnection is usually detected by read() sys call returning 0 bytes read, having been selected for read.
        • Also, during an asynchronous connect attempt, an ECONNREFUSED errno is returned (see: TCP 3-way handshake)

      3. Use of select()

        • The select() sys-call is used to field interrupt-level events from various device drivers in the application program.
        • When a device becomes "Ready to Read", the driver performs an up-call to any processes which have selected that device for reading.
        • When a device achieves "Ready to Send", the driver performs an up-call to any processes which have selected that device for writing.
        • Ubs/Ubc do not make use of select()'ing for exceptions.
        • The last parameter of the select() sys-call specifies a timeout.
          • If this structure pointer is NULL, then no timeout occurs, i.e. the select() blocks.
            Null timeout is not used in UB Server or Client.
          • If the time value is zero, then the select() polls and returns immediately.
            This can be used to see if we are really stuffed with work to do.
          • Usually, the time value is the next delta-time, i.e. the next timer list event.

      4. Timer list
        • The timer routines maintain a linked list of timers.
        • Each Client stucture member .delta specifies the difference in time from the previous timer in the list.
        • When adding or removing timers, this relationship must be maintained.
        • When a select() sys-call is made, the first time in the list is used as the timeout value.
        • On return from a select(), the time remaining should replace the value in the first element of the timer list.


  • Forward Strategy



  • Glossary