Data preprocessing and data reduction

Tecnologia 419 Visitas

Data preprocessing and data reduction, the set of techniques used prior to the application of data mining are named as data preprocess-ing, and are known to be one of the most meaningful issues within the famous KDDprocess.

Since raw data will likely contain imperfect, containing inconsistencies and re-dundancies in their initial shape, they will not be valid for further data mining process. Wemust also mention the fast growing of data generation rates and their size in business, industrial,academic and science applications.

The huge amounts of data collected nowadays require moresophisticated mechanisms to properly analyze them. Data preprocessing is able to adapt the datato the requirements posed by each data mining algorithm, enabling its processing which would beunfeasible otherwise.

Albeit data preprocessing is a powerful tool that can enable the user to treat and processcomplex data, it may consume large amounts of processing time. It includes a wide rangeof disciplines, as data preparation and data reduction techniques.

Theformer includes data transformation, integration, cleaning and normalization; while the latter aimsto reduce the complexity of the data by applying feature or instance selection, or data discretization. After the application of a successful data preprocessing stage, the final data set canbe regarded as a reliable and suitable source for any data mining algorithm.

Among the long list of data preprocessing techniques, this thesis is focused on data reduction,concretely, on discretization and IS. The aim of data reduction is to provide a more manageabletraining set in terms of complexity and size in order to improve accuracy, memory and time performance of the subsequent DM phase. Different families of techniques are part of data reduction,here we highlight most relevant:

  • Feature selection (FS): is “the process of identifying and removing as much irrelevant andredundant information as possible”. The goal is to obtain a subset of features fromthe original problem that still appropriately describe it. This subset is commonly used totrain a learner, with added benefits reported in the specialized literature. FS can remove irrelevant and redundant features which may induce accidental correlationsin learning algorithms, diminishing their generalization abilities. The utilization of FS is alsoknown to decrease the risk of over-fitting, as well as to reduce the feature space, thus makingthe learning process faster and less memory-consuming.
  • Feature Extraction (FE): is the process of generating new features by transforming the train-ing input space to a new space that better describes the problem. In FE,original attributes can be removed, maintained or they may serve to create new artificialattributes. Linear and non-linear space transformations or statistical techniques such asprincipal component analysis or single value decomposition are classical algorithms inthis field.
  • Instance selection: is comprised by a series of techniques aiming at selecting a subset of datathat replaces the original data set, at the same time being able to fulfill the learning goal de-fined at the start. We must distinguish between instance selection, whichimplies a smart operation of instance categorization, and data sampling, which constitutes amore randomized approach.
  • Instance generation (IG): besides selecting data, may generate and replace the original datawith new artificial examples. IG allows us to fill regions of the input domain incase no representative examples exist there, or to condensate large amounts of instances incrowded regions. IG methods are often called prototype generation methods, as the artificialexamples created tend to act as a pivotal example in a region or a subset of the originalinstances.