Xiaowei Xu

Associate Professor
Information Science Department
University of Arkansas at Little Rock
Phone: (501) 683-7266
Fax: (501) 569-8002
Email: xwxu@ualr.edu

Education

Work Experience

Research Interests

Summary of previous Work

Current and future Work

Professional Activities

Publications

Education

July 1998	Ph.D. (Dr. rer. nat.) in computer science, University of Munich (LMU), Germany Thesis: "Efficient Clustering for Knowledge Discovery in Spatial Databases"
Dec. 1987	M.Sc in computer science, Shenyang Institute for Computing Technology, Chinese Academy of Sciences, P. R. China Thesis: "Design and Implementation of a Bootstrap for a Real-Time OS"
July 1983	B.Sc. in computer science, Nankai University, Tianjin, P. R. China

Work Experience

2002 - present	Associate Professor, Information Science Department, University of Arkansas at Little Rock
1998 - 2002	Senior Research Scientist, Siemens AG, Corporate Technology, Information and Communications
1993 - 1998	Teaching and Research Assistant, University of Munich (LMU), Dept. for Computer Science, Database and Information Systems Unit, Prof. Dr. H.-P. Kriegel
1992 - 1993	Visiting Scholar, University of Hildesheim, Institute of Operating Systems and Distributed Computing, Prof. Dr. G. Stiege: Research project on load balance algorithms for distributed computer systems
1988 - 1992	Associate Research Scientist, Chinese Academy of Sciences, Shenyang Institute for Computing Technology: Design and implementation of an operating system, which was awarded the first prize in "The Progress of Science and Technology", one of the highest prize for research in P. R. China
1983 - 1985	Assistant Research Scientist, Chinese Academy of Sciences, Shenyang Institute for Computing Technology: Design and implementation of a compiler for C

Research Interests

Knowledge Discovery in Databases and Data Mining: Clustering algorithms, classification, trend detection, feature and instance selection, parallel and distributed data mining algorithms, interfaces to database management systems, scalable algorithms for large databases, web/text mining, collaborative filtering, data mining in biological/medical databases
Spatial Database Systems: Efficient query processing, spatial access methods, applications in geographic information systems
Multimedia Database Systems: Similarity search, index structures, query processing, applications in biological, medical and CAD databases
OLAP and Data Warehousing: Index structures, data reduction, data mining and knowledge discovery in data warehousing environment

Summary of previous Work (references refer to the publication list)

With my students and colleagues, I developed a set of scalable data mining methods for the (semi-) automatic extraction and analysis of "patterns" from spatial as
well as web logs and customer databases. The specific topics that I have been able to make contributions to can be broadly categorized into the following areas:

Knowledge Discovery and Data Mining

Clustering Algorithms

DBSCAN [22] is a density-based clustering method, which was designed to detect clusters of arbitrary shape as well as to distinguish noise in spatial and multi-dimensional databases. Technically, the algorithm is based on region queries, which can be supported efficiently by spatial index structures such as R-trees (at least, if the dimension of the data space is not too high). PDBSCAN [3] is a parallel version of DBSCAN on the `shared-nothing' architecture with multiple computers interconnected through a network. PDBSCAN offers linear speedup and has excellent scaleup and sizeup behavior. For clustering in dynamic databases (i.e., when the database changes through insertions and deletions over time) an efficient incremental version of DBSCAN was developed [18] . Determining "natural" parameters for a density-based clustering of a data set may be difficult. This problem is solved by the new clustering algorithm DBCLASD, which is based on the assumption that the points inside a cluster are uniformly distributed [19]. BRIDGE [13] efficiently merges the K-means and DBSCAN by exploiting the advantages of one to counter the limitations of the other and vice versa. One problem with DBSCAN is its tendency to merge many slightly connected clusters together. The problem is addressed by the RDBC algorithm [2] .

Collaborative Filtering

Collaborative filtering uses a database about consumers' preferences to make personal product recommendations and is achieving widespread success in E-Commerce nowadays. However, the traditional collaborative filtering algorithms do not scale well to the ever-growing number of consumers. The quality of the recommendation also needs to be improved in order to gain more trust from the consumers. To improve the efficiency and the accuracy, feature weighting and instance selection are studied from an unified information-theoretic perspective [1]. Two feature-weighting methods to improve the accuracy of collaborative filtering algorithms was proposed in [12] . Furthermore, we introduced an information-theoretic approach to measure the relevance of a consumer (instance) to the given product (target concept) and proposed to reduce the training data set by selecting only highly relevant instances [11]. A further significant permanence improvement can be achieved by the data reduction techniques [10].

Spatial Data Mining

The effectivity of spatial clustering algorithms is somewhat limited because they do not fully exploit the richness of the different types of data contained in a spatial database. We proposed the concept of density-connected sets and present GDBSCAN, a significantly generalized version of DBSCAN ( [4] and [20]). The major properties of this algorithm are as follows: (1) any symmetric predicate can be used to define the neighborhood of an object allowing a natural definition in the case of spatially extended objects such as polygons, and (2) the cardinality function for a set of neighboring objects may take into account the non-spatial attributes of the objects as a means of assigning application specific weights. Density-connected sets can be used as a basis to discover trends in a spatial database [20]. We defined trends in spatial databases and showed how to apply GDBSCAN algorithm for the task of discovering such knowledge. An application of this technique in the area of economic geography can be found in [20].

Web Mining

A great challenge for web site designers is how to ensure users' easy access to important web pages efficiently. We developed a clustering based approach to address this problem [2]. Our approach to this challenge is to perform efficient and effective correlation analysis based on web logs and construct clusters of web pages to reflect the co-visit behavior of web site users. We presented a novel approach for adapting DBSCAN in the problem domain of web page clustering [14], and show that our new methods can generate high-quality clusters for very large web logs when previous methods fail. Based on the high-quality clustering results, we then applied the data-mined clustering knowledge to the problem of adapting web interfaces to improve users' performance. We developed an automatic method for web-interface adaptation: by introducing index pages that minimize overall user browsing costs [2]. The index pages are aimed at providing short cuts for users to ensure that users get to their objective web pages fast, and we solved a previously open problem of how to determine an optimal number of index pages. We empirically showed that our approach performs better than many of the previous algorithms based on experiments on several realistic web-log files.

Similarity Search in Spatial and Multimedia Databases

Proteins play an important role in every living organism since they are the acting instances for all fundamental processes of life like in the digestive system, metabolism, and immunosystem. The function of proteins takes place as an interaction with other molecules which is called docking. An important heuristic for the prediction of molecular interaction is the "key-and-lock"-principle. The docking sites of the partner molecules have a strong complementarity, especially concerning the geometry. Many docking sites may be determined solely by this complementarity geometry. Thus, the docking problem may be transformed to a search problem for complementary surface segments. In the project BIOWEPRO [26], funded by German Ministry for Education, Science, Research, and Technology (BMBF), we developed new database techniques to effectively and efficiently support the 1:n-docking prediction for proteins [25]. Our approach includes new representation and storage methods for molecular surfaces as well as new methods for similarity query processing for 3D surface segments with respect to shape similarity. The selection of segments from the database which have a similar (or complementary) 3D shape yields a set of potential docking candidates for the query protein. While following our segmentation approach, we computed the molecular surface and extract potential docking segments for all the proteins in our database. For each of the segments, various shape representations are computed that are appropriate to support a complementarity search in the database.

Multidimensional Query Processing

We developed a new technique for multidimensional query processing which can be widely applied in database systems [15]. The new technique, called tree striping, generalizes the well-known inverted lists and multidimensional indexing approaches. A theoretical analysis of this generalized technique shows that both, inverted lists and multidimensional indexing approaches, are far from being optimal. A consequence of the analysis is that the use of a set of multidimensional indexes provides considerable improvements over one d-dimensional index (multidimensional indexing) or d one-dimensional indexes (inverted lists). The basic idea of tree striping is to use the optimal number k of lower-dimensional indexes determined by the theoretical analysis for efficient query processing. We confirmed our theoretical results by an experimental evaluation on large amounts of real and synthetic data. The results show a speed-up of up to 310% over the multi-dimensional indexing approach and a speed-up factor of up to 123 (12,300%) over the inverted-lists approach.

Current and future Work

The areas where I am particularly interested in making further progress can be roughly categorized as follows:

Text Mining

Unstructured text databases are common in many manufacturing and service business operations. The service reports of automobiles, description of claims in insurance industry, medical records are some examples. Over the period of time such databases continue to grow and become a huge and unwieldy source of information. This information can be used for making the business operation more efficient and saving unnecessary expenses. For example, an automobile manufacturing industry may have a database containing customer service records performed by its dealers; such information may be used, for example, to make decisions about the future thrust-directions on research and development based on the reported problems for some product. Also, such information is very valuable in making marketing related decisions. However, data mining from large text/hyper-text databases is especially challenging because of its extremely high dimensional data, and distributed storage. I am currently working on a new hierarchical clustering algorithm to construct the concept hierarchy automatically from large text corpus [9].

Spatial-temporal Data Mining

Spatial and temporal data mining is the non-trivial extraction of implicit, potentially useful and novel knowledge with an implicit or explicit spatio-temporal content from large spatio-temporal databases. Spatio-temporal data mining is an very promising subfield of data mining because increasingly large volume of spatio-temporal data are collected and need to be analyzed. Spatio-temporal data mining is also challenging research area because the spatio-temporal data and knowledge is much more complex then no-spatial and no-temporal data. I am working on spatio-temporal data mining method for personalized location-dependent information filtering. A information filtering algorithm which explores the content of the information, the usage of the information, and the location/time of information will be developed for mobile business.

Data Mining in biological data / Bioinformatics

I am also very interested in the area of Bioinformatics. Many data analysis tasks in Biology can be approached from a data mining perspective. I have the experience in the development of a database management system to support protein-protein docking prediction [25,26]. In the future, I want to working on the following problems: 1. Clustering gene expression data of different tissues, which requires the development of clustering techniques for ultra-high dimensional data (about 200,000). 2. Finding suppression relations between genes using the same gene expression database as in the first project. 3. Prediction the structure of protein.

Professional Activities

Invited Talks

International Workshop on Management of Information on the Web - Web Data and Text Mining (MIW'01), in conjunction with the 12th International Conference on Database and Expert Systems Applications (DEXA'2001), September, 2001, Munich, Germany
University of Munich, Dept. for Computer Science, January 2001
Dagstuhl Symposium on Declarative Database on the Web, September, 1999
Microsoft Research China, November 1999
ABB Corporate Research Ltd., ABB Industry AG, April, 1998
Ubilab, Information Technology Laboratory of UBS AG, June, 1998
European Science Foundation Workshop on "From Information Fusion to Data Mining", Granada, Spain, April 1997

Program Committee (PC) Members

Session Chair for the Second SIAM International Conference on Data Mining (SDM'02), April 2002, Arlington, VA.
International Workshop on Management of Information on the Web - Web Data and Text Mining (MIW'01), in conjunction with the 12th International Conference on Database and Expert Systems Applications (DEXA'2001), September, 2001, Munich, Germany
International Workshop on Data Models and Databases on Clusters and the Grid (DataGrid 2001), in conjunction with IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2001), May 2001, Brisbane, Australia

Refereeing and Reviewing of Journal and Conference Submissions

Distributed and Parallel Databases, An International Journal, Kluwer Academic Publishers
Knowledge and Information Systems, An International Journal, Springer-Verlag
ACM SIGMOD International Conference on Management of Data (SIGMOD'95), San Jose, California, May, 1995
Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'97) in cooperation with ACM-SIGMOD'97 Tucson, Arizona, May, 1997

Publications

Research Journals

Yu K., Xu X., Wen Z. and Ester M.: "Feature Weighting and Instance Selection for Collaborative Filtering: An Information-Theoretic Approach", Knowledge and Information Systems, An International Journal, Springer, Accepted for publication.
Su Z., Yang Q., Zhang H., Xu X. and Hu Y.-H.: "Correlation-based Web-Document Clustering for Adaptive Web-Interface", Knowledge and Information Systems 4(2): 151-167 (2002).
Xu X., Jäger J. and Kriegel H.-P.: "A Fast Parallel Clustering Algorithm for Large Spatial Databases", Data Mining and Knowledge Discovery, an International Journal, Volume 3, Issue 3, September 1999, pp. 263-290, Kluwer Academic Publishers. Abstract
Sander J., Ester M., Kriegel H.-P., Xu X.: "Density-Based Clustering in Spatial Databases: A New Algorithm and its Applications", Data Mining and Knowledge Discovery, an International Journal, Volume 2, Issue 2, June 1998, pp. 169-194, Kluwer Academic Publishers. Abstract
Ester M., Kriegel H.-P., Sander J., Xu X.: "Clustering for Mining in Large Spatial Databases", KI Künstliche Intelligenz (Journal Artificial Intelligent), 1, 1998, ScienTec Publishing, pp. 18-24. Abstract
Xu X., Ester M., Kriegel H.-P., Sander J.: "Clustering and Knowledge Discovery in Spatial Databases", Vistas in Astronomy, 1997, (Special issue, proceedings of European Science Foundation workshop on "From Information Fusion to Data Mining", Granada, April 1997, editors R. Molina, F. Murtagh and A. Heck).

Refereed Conference Proceedings

Yu K., Xu X., Schwaighofer A. and Kriegel H.-P.: "A Likelihood-Based Approach to Data Selection for Collaborative Filtering", ACM 11th International Conference on Information and Knowledge Management (CIKM'02), November 2002, McLean, VA, ACM Press.
Beil F., Ester M. and Xu X.:"Frequent Term-Based Text Clustering", 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada.
Yu K., Xu X., Tao J. Ester M. and Kriegel H.-P.: "Efficient Collaborative Filtering: Reduction Techniques for Large Preference Database with Many Missing Values", Second SIAM International Conference on Data Mining (SDM'02), April 2002, Arlington, VA. Paper (pdf 485k)
Yu K., Wen Z., Xu X., Ester M., and Kriegel H.-P. "Selecting Relevant Instances for Efficient and Accurate Collaborative Filtering", ACM 10th International Conference on Information and Knowledge Management (CIKM'01), November 2001, Atlanta, Georgia, ACM Press. Paper (pdf 366k)
Kai Yu, Zhong Wen, Xiaowei Xu and Martin Ester: "Feature Weighting and Instance Selection for Collaborative Filtering". 2nd International Workshop on Management of Information on the Web - Web Data and Text Mining (MIW'01), in conjunction with the 12th International Conference on Database and Expert Systems Applications (DEXA'01), September 2001. Munich, Germany, IEEE Computer Society. Paper (pdf 111k)
Manoranjan Dash, Huan Liu, Xiaowei Xu: "'1+1>2': Merging Distance and Density Based Clustering", 7th International Conference on Database Systems for Advanced Applications (DASFAA'01), April 18-20, 2001, Hong Kong, IEEE Computer Society. Paper (pdf 185k)
Zhong Su, Qiang Yang, Hong-Jiang Zhang, Xiaowei Xu and Yu-Hen Hu: "Correlation-based Document Clustering using Web Logs", 34th HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES (HICSS-34), January 3-6, 2001, IEEE Computer Society. Paper (pdf 71k)
Berchtold S., Böhm C, Keim D. A., Kriegel H.-P., Xu X.: "Optimal Multidimensional Query Processing Using Tree Striping", Int. Conf. on Data Warehousing and Knowledge Discovery (DaWaK'00), Greenwich, U.K., 2000, pp. 244-257. Paper (pdf 1.16M)
Xu X.: "Web Mining for E-Commerce", Dagstuhl Symposium on Declarative Database on the Web, September 1999, Dagstuhl, Germany.
Lazarevic A. Xu X., Fietz T. and Obradovic Z.: "Clustering-Regression-Ordering Steps for Knowledge Discovery in Spatial Databases", International Joint Conference on Neural Networks (IJCNN'99), July 10-16, 1999, Washington, DC. Paper (pdf 221k)
Ester M., Kriegel H.-P., Sander J., Wimmer M., Xu X.: "Incremental Clustering for Mining in a Data Warehousing Environment", 24 International Conference on Very Large Databases (VLDB'98), August 24 - 27, 1998, New York City, NY, USA. Paper (pdf 158k)
Xu X., Ester M., Kriegel H.-P., Sander J.: "A distribution-based Clustering Algorithm for Mining in Large Spatial Databases", 14th Int. Conf. on Data Engineering (ICDE'98), Orlando, Florida, USA, 1998. Paper (pdf 169k)
Ester M., Kriegel H.-P., Sander J., Xu X.: "Density-Connected Sets and their Application for Trend Detection in Spatial Databases", 3nd int. Conf. on Knowledge Discovery and Data Mining (KDD'97), AAAI Press, 1997. Paper (pdf 166k)
Xu X., Eser M., Kriegel H.-P., and Sander J.: "Efficient Clustering for Knowledge Discovery in Spatial Databases", Eurapean Science Foundation Workshop on "From Information Fussion to Data Mining", Granada, Spain, April 1997.
Ester M., Kriegel H.-P., Sander J., Xu X.: "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", 2nd int. Conf. on Knowledge Discovery and Data Mining (KDD'96), Portland, Oregon, 1996, AAAI Press, 1996. Paper (pdf 134k)
Ester M., Kriegel H.-P., Xu X.: "Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification", 4th Int. Symp. on Large Spatial Databases (SSD'95), Portland, Maine, USA; Lecture Notes in Computer Science, Vol. 951, Springer, 1995, pp. 67-82. Paper (pdf 343k)
Ester M., Kriegel H.-P., Xu X.: "A Database Interface for Clustering in Large Spatial Databases", 1st Int. Conf. on Knowledge Discovery and Data Mining (KDD'95), Montreal, Canada, 1995, pp. 94-99. Paper (pdf 83k)
Ester M., Kriegel H.-P., Seidl T., Xu X.: "Formbasierte Suche nach komplementaeren 3D-Oberflaechen in einer Protein-Datenbank", GI-Fachtagung Datenbanken in Buero, Technik und Wissenschaft (BTW '95), Informatik aktuell, Springer, 1995, pp. 373-382. (in german). Paper (pdf 238k)
Schomburg D., Jakob U., Meyer M., Wilson P., Sagerer G., Ackermann F., Herrmann G., Posch S., Soumpasis M., Grimm G., Ihmels B., Strahm M., Kriegel H.-P., Seidl T., Schmidt T., Ester M., Xu X.: 'BIOWEPRO' - Wechselwirkungen von Proteinen, BMBF (Hrsg.): Tagungsband BMBF-Statusseminar Bioinformatik "Molekulare Bioinformatik und Evolutionäre Algorithmen", Braunschweig, 07.-08.10.1995, pp.125-153.

Book

Xiaowei Xu: "Efficient Clustering for Knowledge Discovery in Spatial Databases", Aachen: Shaker, 1999, ISDN 3-8265-4760-8.