DMS Tutorial - Data Understanding

Data Understanding

After we have set up our problem and a rough plan for its solution, we can now proceed with the central item in DM process - data. There are several things to be learned about the data before the actual application of data mining techniques.

Collect initial data

First step is preliminary acquisition of the data and necessary preparation for further processing. Data acquisition process should produce following outputs:

list of data acquired
location of data and methods used for acquiring
problems and solutions in preliminary acquisition

Data description

After acquiring the data we have to describe it. This primarily means defining the volume of data (number of examples and attributes), identities and meanings of individual attributes and description of the initial format of the data.

Explore data

Third step in data understanding is not an obligatory one, but useful from many aspects. Main role of data exploration or data surveying in this stage is finding out from the general structure of the data, whether or not there is useful amount of information enfolded in extracted data set(s). The exploration is not concerned with the answer to the problem - this is the task of DM modelling techniques. Basic exploration involves application of simple statistical techniques to reveal basic properties of acquired data: For nominal attributes examining multi-way frequency tables, while for numeric attributes examining of distributions of values for individual attributes and studying correlation matrices should identify main patterns in the data. There are also sophisticated methods (see Pyle, 1999) that can give more information about the potential of data for solving the problem.

Verification of data quality

At this stage checks upon the data can be made which should improve final modelling results. This may include checking the consistency of individual attribute values and types, quantity and distribution of missing values and finding of outliers. Outliers fall into two distinct categories: they either represent errors or true but rare phenomenon. For modelling tools which are not robust to outliers, it is advisable to exclude outlying examples prior to actual modelling. Checking in this phase deals with completeness and correctness of data. Completness defines the proportion and regularity of missing values in data. Correctness is related to discovery of erroneous values present in data, their extent and possible remedies. In both cases data exploration results are crucial for dealing with these problems.