Data Preparation
It is rather straightforward to apply DM modelling tools to data and judge the value of resulting models based on their predictive or descriptive value. This does not diminish the role of careful attention to data preparation efforts. Data preparation process is roughly divided into data selection, data cleaning, formation of new data and data formatting. A subset of data acquired in previous stages is selected based on criteria stressed in previous stages:- data quality properties: completness and correctness
- technical constraints such as limits on data volume or data type: this is basically related to data mining tools which are planned earlier to be used for modelling
- Data normalization. For example decimal scaling into the range (0,1), or standard deviation normalization)
- Data smoothing. Discretization of numeric attributes is one example, this is helpful or even necessary for logic based method.s
- Treatment of missing values. There is not simple and safe solution for the cases where some of the attributes have significant number of missing values. Generally, it is good to experiment with and without these attributes in the modelling phase, in order to find out the importance of the missing values. Simple solutions are: a)replacing all missing values with a single global constant, b)replace a missing value with its feature mean, c) replace a missing value with its feature and class mean. The main flaw of simple solutions is that substituted value is not the correct value. This means that the data will be biased. If the missing values can be isolated to only a few features, then we can try a solution by deleting examples containing missing values, or delete attributes containing most of the missing values. Another solution, more sophisticated one is to try to predict missing values with a data mining tool. In this case predicting missing values is a special data mining prediction problem.
- Data reduction. Reasons for data reduction are in most cases twofold: either the data may be too big for the program, or expected time for obtaining the solution might be too long. The techniques for data reduction are usually effective but imperfect. The most usual step for data dimension reduction is to examine the attributes and consider their predictive potential. Some of the attributes can usually be discarded, either because they are poor predictors or are redundant relative to some other good attribute. Some of the methods for data reduction through attribute removal are: a) attribute selection from means and variances, b) using principal component analysis c) merging features using linear transform.
- derivation of new attributes from two or more existing attributes
- generation of new records (samples)
- data transformation: data normalization (numerical attributes), data smoothing
- merging tables: joining together two or more tables having different attributes for same objects
- aggregations: operations in which new attributes are produced by summarizing information from multiple records and/or tables into new tables with "summary" attributes
- reordering of the attributes or records: some modelling tools require reordering of the attributes (or records) in the dataset: putting target attribute at the beginning or at the end, randomizing order of records (required by neural networks for example)
- changes related to the constraints of modelling tools: removing commas or tabs, special characters, trimming strings to maximum allowed number of characters, replacing special characters with allowed set of special characters
- binary variables (1-for true; 0-for false)
- ordered variables (numerics)
© 2001 LIS - Rudjer Boskovic Institute
Last modified: February 01 2002 13:31:56.