URI Machine Learning and Data Mining Group

	Mission	Theses & Publications	Projects

Selected Past and Current Projects

Visualization of Support Vector Machines using Machine Learning

We are interested in developing an effective visualization of SVMs based on unsupervised learning using emergent self-organizing maps. The key insight is that the SVM models consist of points in high dimensional space.

Protein Folding, Molecular Dynamics And Machine Learning

Energy minimization algorithms for biomolecular systems are essential to applications such as the prediction of protein folding. Conventional energy minimization methods (Steepest Descent method and Conjugate Gradient method) are limited in the energy minima they can detect. Our research group at URI developed an algorithm for biomolecular systems based on genetic algorithm (GA). The GA based energy minimization largely overcomes the drawback of conventional methods. This algorithm significantly increases the probability of reaching deeper energy minima. Our genetic algorithm approach differs from other genetic algorithm based approaches in that we do not use the genetic algorithm to directly compute molecular conformations but instead compute a set of parameters to be used in conjunction with the molecular dynamics simulation package GROMOS96. We have parallelized this algorithm with Message Passing Interface (MPI). Tests have shown that this algorithm is very effective for energy minimization and also demonstrates how the conditions of simulation are correlated with the discovery of an energy minimum.

Text Mining MEDLINE For Automatic Document Assignment With Support Vector Machines And Non-Negative Matrix Factorization Algorithms

The world nowadays is fast becoming information intensive, in which specialized information is being collected into very large data sets. Specialized collections such as MEDLINE contain vast amounts of online text documents and grow and change rapidly. It is nearly impossible to manually organize such vast and rapidly evolving data. The necessity to extract useful and relevant information from such large data sets has led to an important need to develop computationally efficient text mining algorithms. A prototypical problem is to automatically assign natural language text documents to predefined categories based on their content. Here we develop efficient text classification models within the Oracle Data Mining software that analyze the content of the title and abstract fields of MEDLINE database documents for automatic document assignment to different user defined categories, using support vector machines (SVM). The text pre-processing stage will use the “feature extractor” of Oracle data mining software based on the non-matrix factorization algorithm (NMF) to reduce the huge dimensionality of text documents. In order to do multi-label classification a predictive binary SVM model for each category will be constructed.

Protein Structure Analysis With Self-Organizing Maps

Establishing structure-function relationships on the proteomic scale is a unique challenge faced by bioinformatics and molecular biosciences. Large protein families present natural libraries of analogues of a given catalytic or protein function, thus making them ideal targets for the investigation of structure-function relationships in proteins by computer aided techniques. To this end, we have developed a new tool for analyzing structure information in order to elucidate the structure-function relationship of proteins within protein families using unsupervised machine learning, particularly self-organizing maps. For local structure analysis, we start by extracting the essential local structures of proteins from corresponding PDB files by using a PERL editing filter. Subsequently, the local structures extracted from the PDB files are aligned with DS ViewerPro at tether points around a given functional center F, and the resulted alignment are encoded into normalized protein models. Finally we map each normalized protein model into a high-dimensional feature space where each normalized model is represented by a bit vector. The structure of the feature space can be visualized with a 2-D self-organizing map that highlights structural similarities and differences between molecules using reference models on the map. The information captured by a self-organizing map and stored in its reference models highlights the essential structures of the mapped proteins and can be effectively used to study detailed structural differences and similarities among proteins. Preliminary results demonstrate that we can classify proteins and identify common and unique structures within a family as well as identify common and unique structural features of different conformations of the same protein. Similar analysis has also been extended to complete protein structures. Particularly, non-existent features and common features are deleted by a newly created filter in order to reduce the dimension of the feature vectors for SOM. Importantly, filtering the non-existent and common features is a projection from a higher dimensional subspace to a lower dimensional subspace in vector space which maintains important structural features and makes the computation much more efficient. In conclusion, such structural pattern analysis may provide clues to interesting and difficult biochemical questions. The automatic nature of our approach due to machine learning enables us to scale it to study large subsets, if not whole families of proteins.

Evolutionary Concept Learning in Equational Logic

The purpose of this project is to study the effectiveness of genetic algorithms for concept learning where the representation language is equational logic. Concept learning is a branch of machine learning concerned with discriminating and categorizing things based on positive and negative examples. Equational logic is a system of logic that emphasizes rewriting as its primary tool in proofs. Genetic algorithms are a popular general search mechanism for large and complex search spaces.

IVF Data Study

Infertility is a serious problem that affects 6.1 million people in the United States. The In Vitro Fertilization (IVF) procedure has increased the chances for childless couples to become parents. The goal of this study was to investigate an applicability of support vector machines algorithm with respect to IVF data set. The predictive models are constructed and their performance is compared to predictive models constructed with decision trees and neural networks algorithms. In building predictive models the algorithms exhibit strong dependency on how IVF data set was divided in training and test sets. To reduce this dependency the approach of bootstrap method was taken to make final comparisons between models built with different algorithms. The 96% confidence intervals constructed for accuracies of 100 replicas allowed to measure overall performance with less influence of the observations that were crucial for predictive models. The algorithms used their capabilities to find the best parameters for constructing the models on bootstrap samples. Non-overlap intervals show the statistical significance of the results. Taking into account the usability of built models we can conclude that it is appropriate to use decision trees model in the Women and Infant clinic to predict the pregnancy rate in patients. For this data, the advantage of higher accuracy does not override the advantage of a transparent decision.

Web crawler to search topic-specific web resources for Automated Narrative Evolution

Automated Narrative Evolution (ANE) is a tool to enable complex, unfixed temporal structures in digital narrative and facilitate the creation of works that are a hybrid of human authorship, structural design, and machine writing. Digital narrative is created by assembling a set of narrative nodes. Each narrative node is a fragment of text that is denoted semantically by a set of keywords, and also contains a set of interactive points to drive the digital narrative through user interactions. When the user clicks on an interaction point in the digital narrative, the system interprets it as a constraint and adapts the digital narrative. The general purpose of this research is to construct a web crawler to collect data from the World Wide Web, and create non-linear and interactive narratives from analyzed web contents.

Exploration of Novel Methods to Visualize Genome Evolution

Early life on Earth has left a variety of traces that can be utilized to reconstruct the history of life, e.g., the fossil and geological records, and information retained in living organisms. This project focuses on how information can be gained from the molecular record, i.e. information about the history of life that is retained in structure and sequence of macromolecules found in extant organisms. The interpretation of the molecular record necessitates its calibration with respect to the geochemical and fossil records, and needs to consider and incorporate information about biochemical pathways and evolutionary theory. Comparative genome analyses have revealed genomes as mosaics where different parts have different histories. This is caused by the exchange of genes between organisms. However, gene transfer is not so rampant as to turn genomes into assemblies of randomly selected pieces. Rather, genes are usually exchanged between closely related organisms, and the exchange between distantly related organisms is rare. While the concept of a TREE of life might have to be abandoned in favor of a WEB of life, there is hope that the different parts of genomes, in particular metabolic pathways of geobiological interest can be traced though this web and can aid the reconstruction of Earth’s early history. The research proposed here aims to develop algorithms and tools that will allow dissecting the mosaic nature of genomes, to reconstruct the evolutionary history of individual traits in relation to other traits and to the plurality or majority consensus of genes. This will allow detection of co-evolving traits, and correlation of the molecular record with the fossil and geological records. The proposed approach to comparative genome analysis will be especially useful with respect to the early evolution of life and the evolution of metabolic pathways.

Improved Visualization of the Unified Distance Matrix

We are interestd in improving the visualization of the unified distance matrix (UMAT) for self-organizing maps. We are currently exploring the connected component paradigm for visualization and certain graphical distortion techniques in order to improve interpretability of the the self-organizing map.

Convergence Criteria for Self-Organizing Maps

The standard approach to measuring the convergence of a self-organizing map is to measure the changes to the neural elements on the map. If the changes are "stabelized" then the map is said to be converged. Various statistical criteria have been worked to define what "stabelization" means. We approach this slightly differently in that we take a population based approach rather than a map based approach.

Intelligent Tutoring Systems

We investigate the application of machine learning techniques to develop intelligent/adaptive tutoring systems.