The problem of dirty data
What is Dirty Data? Before we get into how to deal with dirty data let us define exactly what it is. Dirty data is a term used to refer to information/data that is misleading, incorrect or without generalized formatting, that has been collected by means of data capture forms It could even be spelling mistakes or poor punctuation, incomplete or outdated data or even data that is duplicated in the database. But what is the cost of dirty data to business? Well it is hard to put a figure on it and that is a big part of the problem. We may know that bad data exists on our systems but we cannot quantify it so tend to ignore it, obviously the wrong approach. And the amount of dirty data in the system will keep increasing as we allow more people freedom of access to input their information onto the system.Where is it coming from? System data can degrade very rapidly, starting with customer information such as spelling of names, addresses, and missing information. Errors like these will accumulate within days, and in a few weeks the sales figures will no longer make sense. A database that offers any kind of unsecured access can become unreliable-and ultimately worthless within two months. Even in a professionally designed and operated system, where the data is strictly controlled, errors exist. But when does dirty data become a problem? Large companies have database administrators who train users on data entry, rules for data validation, and software to support data cleaning. Unfortunately, smaller businesses might not even know they have a problem. So now we know what dirty data is and how it gets there, so what can we do about it? The basic steps for dealing with bad data are the same, whatever the source; detect it, set a toleration level, and remove it. The first two steps are the most difficult to implement and require a level of sophistication. The techniques involved in data cleansing are similar, whether you are dealing with spam, a database, or an XML document.Solving the problem First you have to identify the dirty data, logical really as you cannot deal with a problem if it does not exist. The most straightforward approach is to scan the message, document, or database for characters that don’t belong there or are missing. This involves parsing the content and comparing it against a control that indicates what should be there. Once this is complete you are able to evaluate the problem. Decisions now have to be made concerning the level of the problem, and the counter measures to be auctioned Database administrators know there is some dirty data in every database but they have to scope the cost effectiveness of detection and removal. All databases have critical and non-critical areas, bad data in critical areas are worth the expense but whose budget is going to weed out the bad data from a non critical area. A truly critical time to implement a data cleaning exercise is prior to data mining. The first step then is to analyse the quality of the data. Large amounts of aggregated data can be significantly affected by relatively small amounts of dirty data. Examples of Dirty data Common causes of dirty data are:
Most of the problems comes when working with Text or Excel Files Life is much easier when data source is ODBC compliant database however there are still some potentional problems Imagine that you are loading orders from different countries into your oracle datawarehouse. Part of the data comes from text files, part from MS Excel files and some of the data is direct ODBC connection to the source database. Some files are result of manual consolidation of multiple files Datawarehouse Table Definition is
Every country has different formats for ORDER_DATE and Amount field. This situation is far too familiar for many ETL Consultants
In order to load data we need to make sure that format of Amount and Order_Date fields is consistent.
For amount field we need to get rid of dollars, pounds and commas.
It could easily done by using replace function of Advanced ETL Processor.
For ORDER_DATE field we will apply multiple date formats.
Result of Date Format function is a string in ‘YYYY-MM-DD HH:NN:SS.ZZZ’ format
Result of Transformation
This is just a small example how Advanced ETL Processor can help you to validate and transform data.
Advanced ETL Processor is an ETL tool designed to automate extracting data from ANY database, transform, validate it and load into ANY database . Typical usage of it would be extract data from Excel File,Validate Date Formats, Sort data, deduplicate it and load it into Oracle database, run stored procedure or Sql script, once loading is completed. Unlike Oracle SQL loader, BCP, DTS or SSIS Advanced ETL Processor can also add new and update old records based on primary key.
About the Author: Mike has more than 15 years ox experience designing and implementing data warehouses based on Oracle, MS SQL Server, MySql, PostgreSQL and more he is currently working for DB Software Laboratory