In-Memory Large Scale Data Management and Processing
COM2 Level 4
Executive Classroom, COM2-04-02
closeAbstract:
Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support ultra-low latency data queries and interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some issues such as Big Data problems (specifically, data volume), data consistency and synchronization are also more challenging to handle in the in-memory environment. Hence, we are witnessing a revolution in the re-design of data management systems that exploits main memory as its data storage.
In this thesis, we first give a comprehensive presentation of important technology in memory management, and a thorough study of a wide range of in-memory data management and processing systems, including both data storage systems and data processing frameworks, through a literature review and an experimental analysis. Based on our observation, we propose a unified in-memory big data management system -- MemepiC, which integrates both online data query and data analytics functionality into a single system, by efficiently utilizing the memory hierarchy and exploring the recently emerging RDMA technique.
What's more, we analyze state-of-the-art approaches to extend the capacity of in-memory storage system, by conducting extensive experiments to study the effect of each fine-grained component of the entire process of Caching/Anti-Caching, on performance, prediction accuracy and usability. Based on the analysis, we summarize some guidelines on designing a good Caching/Anti-Caching approach, and design a general user-space virtual memory management mechanism (UVMM), which takes advantage of the efficiency of OS virtual memory management (VMM) in utilizing the hardware, and the flexibility and abundance of application semantics in the user-space design.
Finally, as the recently emerging commodity network hardware (e.g., Infiniband) equipped with a high-performance networking communication mechanism (e.g., RDMA) is now able to bridge the performance gap between network I/O and memory to a large extent, we introduce the Globally Addressable Memory (GAM) that abstracts the memory from a cluster of servers as a shared memory space, which eases the programming model significantly without trading off much performance. We adopt the PGAS (partitioned global address space) addressing model and add another level of DRAM resident cache to exploit data locality. To keep the data consistent in GAM, we design a distributed directory-based cache coherence protocol based on RDMA. Our approaches are evaluated via extensive experiments, showing its superior performance compared with existing works.