DMS Tutorial - ILLM Applications and results

ILLM Applications and Results

ILLM (Inductive Learning by Logic Minimization) is a set of codes designed to solve classification problems. Online rule induction system available through DMS is based on the algorithms from ILLM system. To illustrate the power of the actual ILLM methodology we have listed some real-world like data mining problems on which this methods were succesfully tested.

KDD (Knowledge Discovery in Databases) Cup 1999

Description

The task for the classifier learning contest organized in conjunction with the KDD'99 conference was to learn a predictive model (i.e. a classifier) capable of distinguishing between legitimate and illegitimate connections in a computer network. The training and test data were generously made available by Prof. Sal Stolfo of Columbia University and Prof. Wenke Lee of North Carolina State University. (Update: The training and test datasets are now available in the UC Irvine KDD archive.) The classification problem involved 5 distinct classes and cost sensitive confusion matrix. The volume of available data for training was large (~4 GBytes => ~ 5 million samples in the training set).

Detailed description of the problem and results can be found here.

Solution

ILLM was used to produce rule-sets on random excerpts of few thousand samples. These were tested on the training set 10% of the size of total training data, and those scoring highest scores in terms of accuracy, were put in the pool for clasifying the test set. Final classification was performed using voting between different models, based on the cost sensitive confusion matrix.

Score

There were in total 24 solutions for in the contest. ILLM solution was 17th. KDD Cup 1999 stimulated us to work on new solutions in combining classifiers, and changes in the objective function of the ILLM rule-search algorithm.

CoIL (Computational Intelligence and Learning) Challenge 2000

Description

The CoIL Challenge was a data mining competition organized by the the Computational Intelligence and Learning Cluster, a network of excellence sponsored by the EU. It was held in the period of March-May 2000, in total 43 solutions were submitted (147 participants registered for the challenge). The goal of the challenge was to predict and explain policy ownership, a direct marketing case from the insurance sector. You can read more about the problem tasks and results at the CoIL Challenge site.

Solution

The challenge had two tasks which were evaluated separately. Since ILLM methodology is capable of giving efficient and informative classifiers in the form of conjunctions of literals, both the prediction and description tasks were tackled simultaneously. In the experimental phase we have used 5 fold cross validation on the training set to find out the "optimal" set of parameters for the ILLM rule-set search alogrithm, which would produce robust rule-sets with high lift. Final rule-set gave us high lift, but also an informative description of about half-a-dozen distinct customer subgroups. Detailed description of the results and submissions can be obtained from the report prepared by P. van der Putten and M. van Someren.

Score

Prediction task: 9th of 43
Description task: 2nd of 43

NIPS (Unlabeled Data Competition) 2000

Description

This competition presented a set (11 problems) of supervised learning problems (classification and regression). Its full title is "Unlabeled Data Supervised Learning Competition". Main aim was to test algorithms in novel approach of using unlabeled data to improve learning. The organizers were GNCG (Guelph Natural Computation Group), an interdisciplinary group of faculty, graduate and undergraduate students at the University of Guelph. You can find detailed description of the competition at this site.

Solution

Different solutions were applied for different problems. Generally, iterative procedure was used in which some of the unlabelled samples were used to gradually enlarge the training set, based on the using the noise ellimination technique available in ILLM system.

Score

ILLM system was used on 3 out of 11 problems. Overall score was 7th (p1 1st;p5 6th and 11th;p7 14th).

CINC (Computers in Cardiology) 2001

Description

Challenge required developement of a fully automated method to predict the onset of paroxysmal atrial fibrillation/flutter (PAF), based on the ECG prior to the event. Atrial fibrillation is associated with increased risk of stroke and cardiac disease, and is the most common major cardiac arrhythmia, affecting an estimated 2.2 million people in the United States alone. Currently, no reliable validated methods exist to predict the onset of PAF. Given recent advances in clinical electrophysiology, a prediction tool that would allow for detection of imminent atrial fibrillation is an important step toward the application of targeted therapies that may increase longevity and improve the quality of life for many people. Challenge was organized by PhysioNet and the NIH/NCRR who provided large, well-characterized, and freely available data sets for the competitors. For details of the contest rules, background information, data, and software, please visit PhysioNet . Two different tasks were posed before contestants. Task 1 was screening for patients that have atrial fibrilation. Second task was to device a procedure that would indicate possible immediate onset of PAF episode, for patients suffering from the illness.

Solution

Since the data was in a special format, the organizers provided a special software capable of reading and analyzing the signals from datasets. This software was used to extract features for the learning process. The quality of extracted features was evaluated on the training set, using the ILLM code system. The best set of features was used to construct rule set which can, in conjunction with a special testing algorithm, signalize possible onset of the atrial fibrilation (PAF). Similar procedure was used to construct the solution for classifying patients into a group prone to PAF.

Score

7th score for the prediction of the PAF onset (Task 2).

PTC (Predictive Toxicology Challenge) 2001

Description

Prevention of environmentally-induced cancers is a health issue of unquestionable importance. Almost every sphere of human activity in an industrialized society faces potential chemical hazards of some form. It is estimated that nearly 100,000 chemicals are in use in large amounts every day. A further 500-1000 are added every year. Only a small fraction of these chemicals have been evaluated for toxic effects like carcinogenicity. The US National Toxicology Program (NTP) contributes to this enterprise by conducting standardized chemical bioassays---exposure of rodents (mice and rats) to a range of chemicals--- to help identify substances that may have carcinogenic effects on humans. Since obtaining empirical evidence from such bioassays is expensive and usually too slow to cope with the number of chemicals that can result in adverse effects on human exposure. This has resulted in an urgent need for carcinogenicity models based on chemical structures and properties. It is envisaged that such models would:

generate reliable toxicity predictions for chemicals;
enable low cost identification of hazardous chemicals; and
refine and reduce the reliance on the use of large number of laboratory animals

The Predictive Toxicology Challenge was devised to provide Machine Learning programs with the opportunity to participate in an enterprise of immense humanitarian and scientific value. The Challenge for the year 2001 was to obtain models that predict the outcome of biological tests for the carcinogenicity of chemicals using information related to chemical structure only. Competitors had to submit results for 4 models, for each test chemical. Complete description of the challenge is given here.

Solution

We have used two set of descriptors to train classifiers with the ILLM system:

those produced by Dr. B. Lucic from Rudjer Boscovic Institute (DRAGON set);
descriptor set produced by George Cowan of Pfizer;

Several different models were produced using different subsets of descriptors and optimized for their predictive power. For each predictive task we have produced 3 models, which were different withi respect to their sensitivity threshold. This was due to the fact that models were evaluated using ROC curve methodology.

Score

Only about a dozen research groups participated in the final, model evaluation phase. ILLM results were, on average, in the better half of all the models that were evaluated.