PH.D DEFENCE - PUBLIC SEMINAR

Improving Network Diagnostics using Programmable Networks

Speaker
Mr Pravein Govindan Kannan
Advisor
Dr Chan Mun Choon, Professor, School of Computing


23 Mar 2020 Monday, 03:00 PM to 04:30 PM

Executive Classroom, COM2-04-02

Abstract :

Network Diagnostics (monitoring, debugging and testing) in data centers has always been difficult. The problem is only getting more challenging with link speeds reaching 400 Gbps, the number of end-points crossing 100K and data center topologies getting more complex. Furthermore, with an increase in virtualized and diverse applications, network interactions have become more complicated. Recent studies have noted that network faults are extremely hard to diagnose due to their transient nature. Debugging such hard-to-diagnose issues require a global snapshot of the network to understand and rectify the problems accurately. However, obtaining a consistent global state of the network is extremely difficult with network metrics changing in the order of few nanoseconds.

Recent advances in programmable networking have lead to better control, management and programmability of networks in the control-plane as well as data-plane. In this thesis, we study how network diagnostics of data-center networks can be enhanced by leveraging programmable networks. To enable consistent and fine-grained monitoring of network-wide events in the data-plane, precise time-synchronization is essential in the network data-plane. We design and implement a time-synchronization protocol, DPTP that leverages high-resolution clocks, stateful memory and flexible header parsing available in programmable switches to maintain the clock in the network data-plane of each data-center switch. The network acts as the master clock for the end-hosts thus enabling time-synchronization within sub-RTT timescales. Our evaluation on a multi-switch testbed shows that DPTP can achieve median and 99th percentile synchronization error of 19ns and 47ns between 2 switches, 4-hops apart, in the presence of clock drifts and under heavy network load.

Leveraging the synchronized network data-plane clocks, we design and implement DejaVu, a framework to consistently record network-wide events at a packet-level granularity to debug ephemeral events. We leverage the programmable switches' SRAM to provide a temporal storage of packets recordings and enable network-wide offline debugging using SQL queries. Additionally, DejaVu provides a programming abstraction for the network operators to change configuration of metrics to be collected along with the packet recordings. We evaluate DejaVu using a realistic topology with programmable switches, and achieve consistent ordering of packet records, to correlate and find root cause of various network faults without affecting line-rate traffic.

Finally it is important to test and validate the fixes, protocols and research ideas on a network environment with high fidelity and scale to prevent unexpected failures in the production network. However, the available emulators fail to address the fundamental need of test scenarios requiring diverse and scalable set of network topologies. To facilitate this, we propose a novel approach to embed arbitrary data center topologies on a substrate network of programmable switches using our network virtualization technique, called Bare-metal Network Virtualization (BNV). BNV leverages the control-plane APIs of programmable switches to build a network hypervisor. Our evaluations show that BNV can support various data center topologies with less number of switches and can facilitate building a high fidelity, repeatable and isolated experimentation platform for data center network operators and networking research.
These systems demonstrate that it is possible to develop precise monitoring, debugging and testing frameworks to enhance network diagnostics using programmable networks.