Online Data Processing at Scale
COM2 Level 4
Executive Classroom, COM2-04-02
closeAbstract
The ubiquitous online data bring opportunities of insightful analytics, as well as the challenges of system processing at large scale. Online data are generally processed in two fundamental fashions: stream processing to meet the needs of real-time application, and transaction processing to address the coordination of short atomic computations. In this thesis, we look into the problems of scalable online data processing in distributed database systems, and specifically focus on the design and optimization of the key operators and protocols in real-time OLAP (online analytical processing) and OLTP (online transaction processing). In particular, we investigate the techniques for scalable distributed stream join processing, efficient distributed transaction management, and adaptive and speculative concurrency control protocol.
First, efficient and scalable stream joins play an important role in performing realtime analytics for many cloud applications. However, like in conventional database processing, online theta-joins over data streams are computationally expensive and moreover, being memory-based processing, they impose high memory requirement on the system. To improve the memory efficiency and scalability of distributed stream join processing, we propose a novel stream join model, called join-biclique, which organizes a large cluster as a complete bipartite graph. Join-biclique benefits parallel stream join processing in a distributed setting, and meanwhile suppresses memory consumption for the maintenance of join state.
Second, shared-nothing architecture has been widely used in distributed databases to achieve good scalability. While it offers superior performance for local transactions, the overhead of processing distributed transactions can degrade the system performance significantly. The key contributor to the degradation is the expensive two-phase commit (2PC) protocol used to ensure atomic commitment of distributed transactions. To address this issue , we propose a transaction management scheme called LEAP to avoid the 2PC protocol within distributed transaction processing. Instead of processing a distributed transaction across multiple nodes, LEAP converts the distributed transaction into a local transaction. This benefits the processing locality and facilitates adaptive data repartitioning when there is a change in data access pattern.
Third, existing concurrency control protocols for OLTP perform remarkably well under specific workloads or access patterns that they have been designed for. However, they often do not scale well when the workload is dynamic. To tackle the challenge of dynamic workloads, we propose an Adaptive and Speculative Optimistic Concurrency Control protocol (ASOCC) for effective transaction processing. Based on the real-time monitoring of data access frequency, ASOCC adaptively embeds two-phase locking (2PL) into the OCC scheme to facilitate superior contention resolution with reduced transaction aborts. Further, ASOCC dynamically inspects the correlation of data accesses and exploits such information to perform speculative transaction restart to save the CPU cycles wasted on the processing of transactions that are destined to abort.