PH.D DEFENCE - PUBLIC SEMINAR

Synthetically Scaling An Empirical Dataset

Speaker

Mr Zhang Jiangwei

Advisor

Dr Y.C. Tay, Emeritus Professor, School of Computing

03 Jul 2018 Tuesday, 02:00 PM to 03:30 PM

Abstract:

Large-scale enterprises, like Amazon and Douban, have enormous datasets. For research and development, it is impractical to run experiments with such a large dataset. It is therefore often necessary to obtain a smaller version of the dataset for experiments. We call this the scaling down problem.

At the other extreme, a start-up company may have a small dataset, but wants to test the scalability of their system. They may, therefore, want to have a larger (and necessarily) synthetic version of their current empirical dataset.

We call this the scaling up problem.

This motivates the Dataset Scaling Problem (DSP):

"Given an original dataset D and a scale factor s, generate a scaled dataset D' that is similar to D but s times its size. "

This thesis studies DSP in the domain of graph and relational databases. We address the following three questions:

1. How to generate a scaled graph that is similar to a given graph?
2. How to generate a scaled relational database that is similar to a given relational database?
3. How to generate a scaled graph/relational database in a distributed manner?
4. How to facilitate flexibility in the choice and enforcement of similarity properties?

Synthetically Scaling An Empirical Dataset

COM2 Level 2