PH.D DEFENCE - PUBLIC SEMINAR

Optimizing System Performance by using Non-volatile Memory

Speaker
Mr Chen Cheng
Advisor
Dr Wong Weng Fai, Associate Professor, School of Computing


02 Mar 2022 Wednesday, 10:00 AM to 11:30 AM

Zoom presentation

Abstract:

Traditional computer architecture, which also called Von Neumann architecture, loads data from slow persistent storage to CPU for computation. To cut down the tremendous speed gap between fast CPU and slow persistent storage device, the computer involves the volatile DRAM in between to speed up the data access process. However, the data stored in DRAM will loss when system failure happens. To protect data from being corrupted in a crash, the system has to flush the latest updates from DRAM to the slow persistent storage frequently. Such kind of synchronous operations harms the system performance significantly. New types of memory such as Intel Optane DC Persistent Memory Module(PMem) [91], and Non-volatile Dual In-line Memory Module (NVDIMM) [158] are not only with DRAM-like performance and byte-addressability but also persistent just like HDDs and flash memory (flash-based SSDs). The next generation Non-volatile memory (NVM) provides us an opportunity to rethink and redesign the different storage hierarchies. In this thesis, we first explore the way of improving the performance of the filesystem journaling by using NVDIMM. Different from the straightforward way of replacing slow file system journaling devices (such as HDD and SSD) with NVDIMM, we propose NV-Journaling. NV-Journaling presents fine-grained commits along with a cache-friendly NVM journaling layout that significantly reduces checkpoint frequency and gains better space utilization. To maximize these quentiality of the checkpoint IO, NV-Journaling further reshapes the IO pattern of checkpoint using a locality-aware checkpointing process. Our experimental results show that NV-Journaling can improve performance by up to 4.3 times faster compared to traditional journaling.

After the file system, we then consider the next level of the storage hierarchy: AIin-memory database for On-line Decision Augmentation (OLDA). OLDA has been broadly applied in many applications such as real-time fraud detection, personalized recommendation, etc. As one of the most time-consuming operations in OLDA data pipeline, on-line feature extraction requests extracting features from multiple time windows in real-time. We first started by studying how existing in-memory databases can be leveraged to efficiently support such real-time feature extractions. However, we found that existing in-memory databases cost hundreds or even thousands of milliseconds. This is unacceptable for OLDAapplications with strict real-time constraints. We therefore propose FEDB(FeatureEngineeringDatabase), a distributed in-memory database system designed to efficiently support on-line feature extraction. To overcome the three pain points of FEDB: huge memory consumption, long tail latency and long recovery time, we propose PMem-optimized persistent skiplist to make FEDBmore cost-effective. Comparing with the DRAM-based FEDB using DRAM+SSD, PMem-based FEDB can shorten the tail latency up to 19.7%, reduce the recovery time up to 99.7%, and save up to 58.4% total cost of a real OLDA pipeline.

As one of the important use-case of AI databases, we then explore the method of using NVM for Deep learning recommendation models (DLRM) [104] training process. DLRM has been a popular approach in recommendation systems in many large-scale e-commerce and online applications in data centers. Recently, we have witnessed the rapid increases in the model size as well as the number of features in DLRM, which leads to Terabytes of model size. Such huge model sizes pose significant challenges in the parameter access efficiency and training reliability, even more challenging for long running DLRM training. In this work, we propose OpenEmbedding to address those challenges by taking advantage of emerging persistent memory (PMem). Compared to DRAM, PMem can have much lower per-GB cost, higher density and non-volatility, while with slightly low access performance to DRAM. With PMem, we develop an efficient training pipeline to capture the best of both worlds (PMem and DRAM). For reliability, we develop a lightweight batch-aware checkpointing scheme that is specially optimized for DLRM batch-based training on PMem. We further integrate OpenEmbedding with Tensorflow/Keras framework for ease of use and make it open-sourced. Our evaluations with a real-world workload of billions of parameters demonstrate 1) the effectiveness of our PMem-aware optimizations,2) a checkpointing mechanism with little runtime overhead to the training performance.