검색 상세

Group-based Data Retrieval and Clustering on Spatial Datasets

Group-based Data Retrieval and Clustering on Spatial Datasets

초록/요약

This dissertation presents group-based data retrieval and clustering on spatial datasets. Spatial query processing methods are proposed in data retrieval domain and efficient spatial clustering algorithms are proposed in data analysis domain. Given two sets of data points R and S, a query point q, distance threshold δ, and cardinality threshold k, the NBN query retrieves a nearest point r (called the base-point) in R where more than k points in S are located within a distance δ. In this paper, we formally define a base-point and NBN problem. As the brute-force approach to this problem in massive datasets has large computational and I/O costs, we propose external memory processing techniques for the NBN queries. In particular, our proposed in-memory processing in memory buffer are used to minimize I/Os in the external memory algorithms. Furthermore, we devise an index, the neighborhood augmented grid (called NAG), to dramatically reduce the search space. A performance study is conducted on both synthetic and real datasets. Our experimental results show the efficiency of our proposed approach. The k-prototypes algorithm is one of the well-known algorithms for clustering both numeric data and categorical data. Despite this, however, clustering a large number of spatial objects with numeric and categorical attributes is still inefficient due to its high time cost. In this paper, we also propose a grid-based k-prototypes algorithm, call the GK-prototypes algorithm, which achieves high performance for clustering spatial objects. We adopt two pruning techniques in development of the GK-prototypes algorithm. The first technique uses both maximum and minimum distance between cluster centers and a cell, which can remove unnecessary distance calculation. The second technique utilizes spatial dependence that spatial data tend to be more similar as objects are closer. Each cell has a bitmap index that maintains categorical values of all objects in the same cell for each attribute. The bitmap index improves the performance in case that a categorical data has a skewed distribution. Our evaluation experiments show that our algorithm performs better than the existing k-prototypes algorithm.

more

목차

Chapter 1. Introduction 1
Chapter 2. Background 8
2.1. Spatial Query Processing 9
2.1.1. Nearest Neighbor Queries 9
2.1.2. kNN-Joins 10
2.1.3. Queries for Group Search 10
2.2. Spatial Data Mining 13
2.2.1. Clustering for Mixed Data 14
2.2.2. Grid-based Clustering 15
Chapter 3. Nearest Base-Neighbor Queries 18
3.1. Problem Definition 19
3.2. Incremental Algorithms on aR-tree 21
3.3. Neighborhood Augmented Grid 27
3.3.1. Design of the NAG 28
3.3.2. Determining the Base-Point in NAG 30
3.4. Two Phase Processing Algorithm 33
3.4.1. Minimal Candidate Distance and Minimal Candidate Regions 34
3.4.2. NBN Processing 37
3.5. Experiments 39
3.5.1. Experimental Environment 39
3.5.2. Experimental Results 41
Chapter 4. Grid-based K-Prototypes Algorithms 48
4.1. Preliminary 49
4.2. Cell Pruning based GK-Prototypes Algorithm 55
4.3. Bitmap Pruning based GK-Prototypes Algorithm 59
4.4. Experiments 63
4.4.1. Experimental Environment 63
4.4.2. Experimental Results 64
Chapter 5. Conclusions 73

more