DMS Home

DM Methodology

Data Preparation

It is rather straightforward to apply DM modelling tools to data and judge the value of resulting models based on their predictive or descriptive value. This does not diminish the role of careful attention to data preparation efforts. Data preparation process is roughly divided into data selection, data cleaning, formation of new data and data formatting.

Select data

A subset of data acquired in previous stages is selected based on criteria stressed in previous stages:

Data cleaning

This step complements the previous one. It is also the most time consuming due to a lot of possible techniques that can be implemented so as to optimize data quality for future modelling stage. Possible techniques for data cleaning include:

New data construction

This step represents constructive operations on selected data which includes:

Data formatting

Final data preparation step which represents syntactic modifications to the data that do not change its meaning, but are required by the particular modelling tool chosen for the DM task. These include: There is also what is by DM practitioners called standard form of data (although there is not a standard format of data that can be readilly read by all modelling tools). Standard form refers primarily to readable data types: Categorical variables are in standard form of data transformed into -m- binary variables where m is the number of possible values for the particular variable. Since distinct DM modelling tools usually prefer either categorical or ordered attributes, the standard form is a data presentation that is uniform and effective accross a wide spectrum of DM modelling tools and other exploratory tools.

© 2001 LIS - Rudjer Boskovic Institute
Last modified: February 01 2002 13:31:56.