Information Science Department |
University of Arkansas at Little Rock |
Xiaowei XuAssociate Professor |
July 1998 |
Ph.D. (Dr. rer. nat.) in computer science, University of Munich (LMU),
Germany |
Dec. 1987 |
M.Sc in computer science, Shenyang Institute for Computing Technology,
Chinese Academy of Sciences,
P. R. China |
July 1983 |
B.Sc. in computer science, Nankai University, Tianjin, P. R. China |
2002 - present |
Associate Professor, Information Science Department, University of Arkansas at Little Rock |
1998 - 2002 |
Senior Research Scientist, Siemens AG, Corporate Technology, Information and Communications |
1993 - 1998 |
Teaching and Research Assistant, University of Munich (LMU), Dept. for Computer Science, Database and Information Systems Unit, Prof. Dr. H.-P. Kriegel |
1992 - 1993 |
Visiting Scholar, University of Hildesheim, Institute of Operating Systems and Distributed Computing, Prof. Dr. G. Stiege: Research project on load balance algorithms for distributed computer systems |
1988 - 1992 |
Associate Research Scientist, Chinese Academy of Sciences, Shenyang Institute for Computing Technology: Design and implementation of an operating system, which was awarded the first prize in "The Progress of Science and Technology", one of the highest prize for research in P. R. China |
1983 - 1985 |
Assistant Research Scientist, Chinese Academy of Sciences, Shenyang Institute for Computing Technology: Design and implementation of a compiler for C |
With my students and colleagues, I developed a set of
scalable data mining methods for the (semi-) automatic extraction and analysis
of "patterns" from spatial as
well as web logs and customer databases. The specific topics that I have
been able to make contributions to can be broadly categorized into the following
areas:
DBSCAN [22] is a density-based clustering method, which
was designed to detect clusters of arbitrary shape as well as to distinguish
noise in spatial and multi-dimensional databases. Technically, the algorithm
is based on region queries, which can be supported efficiently by spatial
index structures such as R-trees (at least, if the
dimension of the data space is not too high). PDBSCAN [3]
is a parallel version of DBSCAN on the `shared-nothing' architecture with
multiple computers interconnected through a network. PDBSCAN offers linear
speedup and has excellent scaleup and sizeup behavior. For clustering in
dynamic databases (i.e., when the database changes through insertions and
deletions over time) an efficient incremental version of DBSCAN was developed
[18] . Determining "natural" parameters for a density-based
clustering of a data set may be difficult. This problem is solved by the
new clustering algorithm DBCLASD, which is based on the assumption that the
points inside a cluster are uniformly distributed [19].
BRIDGE [13] efficiently merges the K-means and DBSCAN by
exploiting the advantages of one to counter the limitations of the other
and vice versa. One problem with DBSCAN is its tendency to merge many slightly
connected clusters together. The problem is addressed by the RDBC algorithm
[2] .
Collaborative filtering uses a database about consumers' preferences to
make personal product recommendations and is achieving widespread success
in E-Commerce nowadays. However, the traditional collaborative filtering algorithms
do not scale well to the ever-growing number of consumers. The quality of
the recommendation also needs to be improved in order to gain more trust
from the consumers. To improve the efficiency and the accuracy, feature weighting
and instance selection are studied from an unified information-theoretic
perspective [1]. Two feature-weighting methods to improve
the accuracy of collaborative filtering algorithms was proposed in [12] . Furthermore, we introduced an information-theoretic
approach to measure the relevance of a consumer (instance) to the given product
(target concept) and proposed to reduce the training data set by selecting
only highly relevant instances [11]. A further significant
permanence improvement can be achieved by the data reduction techniques [10].
The effectivity of spatial clustering algorithms is somewhat limited because
they do not fully exploit the richness of the different types of data contained
in a spatial database. We proposed the concept of density-connected sets
and present GDBSCAN, a significantly generalized version of DBSCAN ( [4] and [20]). The major properties
of this algorithm are as follows: (1) any symmetric predicate can be used
to define the neighborhood of an object allowing a natural definition in
the case of spatially extended objects such as polygons, and (2) the cardinality
function for a set of neighboring objects may take into account the non-spatial
attributes of the objects as a means of assigning application specific weights.
Density-connected sets can be used as a basis to discover trends in a spatial
database [20]. We defined trends in spatial databases
and showed how to apply GDBSCAN algorithm for the task of discovering such
knowledge. An application of this technique in the area of economic geography
can be found in [20].
A great challenge for web site designers is how to ensure users' easy access
to important web pages efficiently. We developed a clustering based approach
to address this problem [2]. Our approach to this
challenge is to perform efficient and effective correlation analysis based
on web logs and construct clusters of web pages to reflect the co-visit behavior
of web site users. We presented a novel approach for adapting DBSCAN in the
problem domain of web page clustering [14], and show
that our new methods can generate high-quality clusters for very large web
logs when previous methods fail. Based on the high-quality clustering results,
we then applied the data-mined clustering knowledge to the problem of adapting
web interfaces to improve users' performance. We developed an automatic method
for web-interface adaptation: by introducing index pages that minimize overall
user browsing costs [2]. The index pages are aimed at providing
short cuts for users to ensure that users get to their objective web pages
fast, and we solved a previously open problem of how to determine an optimal
number of index pages. We empirically showed that our approach performs better
than many of the previous algorithms based on experiments on several realistic
web-log files.
Proteins play an important role in every living organism since they are
the acting instances for all fundamental processes of life like in the digestive
system, metabolism, and immunosystem. The function of proteins takes place
as an interaction with other molecules which is called docking. An important
heuristic for the prediction of molecular interaction is the "key-and-lock"-principle.
The docking sites of the partner molecules have a strong complementarity,
especially concerning the geometry. Many docking sites may be determined
solely by this complementarity geometry. Thus, the docking problem may be
transformed to a search problem for complementary surface segments. In the
project BIOWEPRO [26], funded by German Ministry
for Education, Science, Research, and Technology (BMBF), we developed new
database techniques to effectively and efficiently support the 1:n-docking
prediction for proteins [25]. Our approach includes
new representation and storage methods for molecular surfaces as well as
new methods for similarity query processing for 3D surface segments with respect
to shape similarity. The selection of segments from the database which have
a similar (or complementary) 3D shape yields a set of potential docking candidates
for the query protein. While following our segmentation approach, we computed
the molecular surface and extract potential docking segments for all the
proteins in our database. For each of the segments, various shape representations
are computed that are appropriate to support a complementarity search in
the database.
We developed a new technique for multidimensional query processing which
can be widely applied in database systems [15]. The
new technique, called tree striping, generalizes the well-known inverted
lists and multidimensional indexing approaches. A theoretical analysis of
this generalized technique shows that both, inverted lists and multidimensional
indexing approaches, are far from being optimal. A consequence of the analysis
is that the use of a set of multidimensional indexes provides considerable
improvements over one d-dimensional index (multidimensional indexing) or
d one-dimensional indexes (inverted lists). The basic idea of tree striping
is to use the optimal number k of lower-dimensional indexes determined by
the theoretical analysis for efficient query processing. We confirmed our
theoretical results by an experimental evaluation on large amounts of real
and synthetic data. The results show a speed-up of up to 310% over the multi-dimensional
indexing approach and a speed-up factor of up to 123 (12,300%) over the inverted-lists
approach.
The areas where I am particularly interested in making further progress can be roughly categorized as follows:
Unstructured text databases are common in many manufacturing and service
business operations. The service reports of automobiles, description of claims
in insurance industry, medical records are some examples. Over the period
of time such databases continue to grow and become a huge and unwieldy source
of information. This information can be used for making the business operation
more efficient and saving unnecessary expenses. For example, an automobile
manufacturing industry may have a database containing customer service records
performed by its dealers; such information may be used, for example, to make
decisions about the future thrust-directions on research and development
based on the reported problems for some product. Also, such information is
very valuable in making marketing related decisions. However, data mining
from large text/hyper-text databases is especially challenging because of
its extremely high dimensional data, and distributed storage. I am currently
working on a new hierarchical clustering algorithm to construct the concept
hierarchy automatically from large text corpus [9].
Spatial and temporal data mining is the non-trivial extraction of implicit,
potentially useful and novel knowledge with an implicit or explicit spatio-temporal
content from large spatio-temporal databases. Spatio-temporal data mining
is an very promising subfield of data mining because increasingly large volume
of spatio-temporal data are collected and need to be analyzed. Spatio-temporal
data mining is also challenging research area because the spatio-temporal
data and knowledge is much more complex then no-spatial and no-temporal data.
I am working on spatio-temporal data mining method for personalized location-dependent
information filtering. A information filtering algorithm which explores the
content of the information, the usage of the information, and the location/time
of information will be developed for mobile business.
I am also very interested in the area of Bioinformatics. Many data analysis
tasks in Biology can be approached from a data mining perspective. I have
the experience in the development of a database management system to support
protein-protein docking prediction [25,26].
In the future, I want to working on the following problems: 1. Clustering
gene expression data of different tissues, which requires the development
of clustering techniques for ultra-high dimensional data (about 200,000).
2. Finding suppression relations between genes using the same gene expression
database as in the first project. 3. Prediction the structure of protein.