Webin data streamer

The Webin data streamer (webin-data-streamer-UDT) is a UDP-based streaming data transmission protocol. It is based on the UDT-Java (Sourceforge) project. The webin-data-streamer-UDT project is available here: https://github.com/enasequence/webin-data-streamer-UDT. The webin-data-streamer-UDT has been developed supported under the BASIS grant from the European Commission.

The protocol provides UDT Sockets as connection endpoints between two machines, similar in usage to standard TCP sockets. Each socket contains an independent Sender and Receiver, allowing for full-duplex operation of a connection. High Speed data transfer is limited to UDT Server Socket Sender --> UDT Client Socket Receiver.

The goal of this Java implementation, inspired by UDT, is to maintain a constantly high rate of data to be transferred, regardless of how quickly or frequently control messages are exchanged. It is also adapted to work with very large files.

Application details

Upon connection setup, client and server exchange information about the size of the UDP DataGrams to be used, and about the initial packet sequence number. This number is used to assemble the complete file on the receiver's side.

Files to be transferred are written in sequence to the UDT Server Socket's output stream. Once all data has been written, the socket is closed. The sender reads data from the stream until there is enough to fit the agreed-upon UDP DataGram size and the packet is sent. Every packet sent is also stored in a file-backed buffer structure, in case it needs to be retransmitted.

On the receiver side packet are received as UDP DataGrams and transformed as proper UDT packets; UDT Control packets are placed in a control queue, UDT Data packets are placed in a separate data priority queue; if the data packet has already been received, or if it already stored in the data priority queue, it is discarded. The packet sequence number is recorded and every 137 packets a 'Summary ACK' control packet is sent back to the Sender. The packet sequence number is also used to place the data at the correct location in a file-backed receiver buffer. The receiver buffer is broken into segments; as soon as one segment is complete, it is streamed out in sequence to the Receiver's input stream, from where it is read by the receiving program.

The goal of this architecture is to enable very large files to be streamed with moderate memory requirements. The use case is files which are transformed (e.g. encrypted) on-the-fly as they are streamed; in other words, data that exists at the point it is sent, but not before or afterwards, which limits the sender to a single stream. The protocol is designed to allow for a very large amount of 'in-flight' packets, which places special requirements on the file-backed buffers.

The Sender Buffer is organized in pages, which are initially held in memory filled sequentially. Once a page has been filled, a separate thread pushes to a temporary disk file and its file reference is held instead. Once the sender knows that all packets held in that page are successfully received, the file is deleted.

The Receiver Buffer operates similarly, except that it is not filled sequentially. The Buffer is organized in pages that are held in memory. Each page is initialized with a packet sequence number and knows which packets are to be stored in it, and at which location in the buffer. Once all expected packets have been saved in the page, a separate thread pushes the data to disk and inserts the page number along with the file reference to a complete-pages prority queue. A separate thread monitors the complete-pages queue and as soon as the next packet in sequence is available, its content is streamed out directly from the file to the input stream of the receiver.

The Sender keeps track of all packets sent, and all packed acknowledged to be received. The Receiver keeps track of all packets received, and packets thought to be missing. Missing packets are determined by gaps in the sequence packet numbers. Missing packets are not immediately sent to the Sender, the isea being that over time, many packets recorded to be missing would be received over time.

The Receiver sends 'SAK' (Summary Acknowledgement) packets every 137 new packets. Based on a timer, the Receiver sends out 'NAK' packets (containing up to 280 packets sequence numbers). Also based on a schedule, the Receiver sends out the largest packet sequence number prior to which all sequential packets have been received. Overall this provides a level of redundancy in case control packets are lost.

The Receiver keeps separate lists of sent packets and packets thought to be missing on the Receiver (based on received NAK packets). As long as there are new packets available to be sent, nothing happens based on the missing packets list. If the sender called .flush() on the output stream, or the end of the file is reached, then packets on the missing list are begun to be re-sent. If no new packets are ready to be sent, and no missing packets are recorded, then all packets that have been listed as sent but not yet received are inserted into the missing packet list. This provides a level of redundancy in case control packets are lost.

Latest ENA news

05 Jan 2018: ENA release 134

Release 134 of ENA's assembled/annotated sequences is now available

21 Dec 2017: ENA services over the holiday period

Between Friday 22nd December and Tuesday 2nd January ENA services such as submissions and retrieval...

21 Dec 2017: ENA release 134 expected early January

The last release of assembled and annotated sequences for 2017 (134) has been particularly...