DMS Tutorial - Data Preparation

Data Preparation

It is rather straightforward to apply DM modelling tools to data and judge the value of resulting models based on their predictive or descriptive value. This does not diminish the role of careful attention to data preparation efforts. Data preparation process is roughly divided into data selection, data cleaning, formation of new data and data formatting.

Select data

A subset of data acquired in previous stages is selected based on criteria stressed in previous stages:

data quality properties: completness and correctness
technical constraints such as limits on data volume or data type: this is basically related to data mining tools which are planned earlier to be used for modelling

Data cleaning

This step complements the previous one. It is also the most time consuming due to a lot of possible techniques that can be implemented so as to optimize data quality for future modelling stage. Possible techniques for data cleaning include:

Data normalization. For example decimal scaling into the range (0,1), or standard deviation normalization)
Data smoothing. Discretization of numeric attributes is one example, this is helpful or even necessary for logic based method.s
Treatment of missing values. There is not simple and safe solution for the cases where some of the attributes have significant number of missing values. Generally, it is good to experiment with and without these attributes in the modelling phase, in order to find out the importance of the missing values. Simple solutions are: a)replacing all missing values with a single global constant, b)replace a missing value with its feature mean, c) replace a missing value with its feature and class mean. The main flaw of simple solutions is that substituted value is not the correct value. This means that the data will be biased. If the missing values can be isolated to only a few features, then we can try a solution by deleting examples containing missing values, or delete attributes containing most of the missing values. Another solution, more sophisticated one is to try to predict missing values with a data mining tool. In this case predicting missing values is a special data mining prediction problem.
Data reduction. Reasons for data reduction are in most cases twofold: either the data may be too big for the program, or expected time for obtaining the solution might be too long. The techniques for data reduction are usually effective but imperfect. The most usual step for data dimension reduction is to examine the attributes and consider their predictive potential. Some of the attributes can usually be discarded, either because they are poor predictors or are redundant relative to some other good attribute. Some of the methods for data reduction through attribute removal are: a) attribute selection from means and variances, b) using principal component analysis c) merging features using linear transform.

New data construction

This step represents constructive operations on selected data which includes:

derivation of new attributes from two or more existing attributes
generation of new records (samples)
data transformation: data normalization (numerical attributes), data smoothing
merging tables: joining together two or more tables having different attributes for same objects
aggregations: operations in which new attributes are produced by summarizing information from multiple records and/or tables into new tables with "summary" attributes

Data formatting

Final data preparation step which represents syntactic modifications to the data that do not change its meaning, but are required by the particular modelling tool chosen for the DM task. These include:

reordering of the attributes or records: some modelling tools require reordering of the attributes (or records) in the dataset: putting target attribute at the beginning or at the end, randomizing order of records (required by neural networks for example)
changes related to the constraints of modelling tools: removing commas or tabs, special characters, trimming strings to maximum allowed number of characters, replacing special characters with allowed set of special characters

There is also what is by DM practitioners called standard form of data (although there is not a standard format of data that can be readilly read by all modelling tools). Standard form refers primarily to readable data types:

binary variables (1-for true; 0-for false)
ordered variables (numerics)

Categorical variables are in standard form of data transformed into -m- binary variables where m is the number of possible values for the particular variable. Since distinct DM modelling tools usually prefer either categorical or ordered attributes, the standard form is a data presentation that is uniform and effective accross a wide spectrum of DM modelling tools and other exploratory tools.