Data cleaning â ⬠or data cleaning â ⬠<â ⬠is the process of detecting and correcting (or deleting) corrupt or inaccurate records from a collection of records, tables , or database and refers to an incomplete, incorrect, inaccurate or irrelevant data identification and then replace, modify, or delete any dirty or abusive data. Data cleaning can be done interactively with data managing tools, or as batch processing through scripting.
After cleaning, a data set must be consistent with another set of similar data in the system. Inconsistencies detected or deleted may initially be caused by a user entry error, by corruption in transmission or storage, or by different dictionary definitions of data from the same entity in different stores. Data cleansing is different from data validation in validation that almost always means data is rejected from the system when logged in and done at login, rather than on a batch of data.
The actual data cleaning process may involve deleting typographic errors or validating and correcting values ââagainst a list of known entities. Validation may be tight (such as rejecting any address that does not have a valid postal code) or fuzzy (like correcting records that are partially in accordance with existing and known records). Some data cleaning solutions will clear the data by cross-examination with a validated data set. A common data cleanup practice is data upgrading, where data is made more complete by adding related information. For example, add an address with a phone number associated with that address. Data cleaning can also involve activities such as, data harmonization, and data standardization. For example, short code harmonization (st, rd, etc.) with actual words (paths, paths, etc.). Data standardization is a means of converting reference data sets into new standards, for example, the use of standard codes.
Video Data cleansing
Motivation
Administratively, incorrect or inconsistent data can lead to false conclusions and misguided investments on a public and private scale. For example, the government may want to analyze the population census figures to decide which areas require further expenditure and investment in infrastructure and services. In this case, it is important to have access to reliable data to avoid faulty fiscal decisions. In the business world, the wrong data can be expensive. Many companies use customer information databases that record data such as contact information, addresses, and preferences. For example, if the address is inconsistent, the company will bear the cost of email resending or even loss of customers. Forensic accounting profession and fraud investigation use data cleaning in preparing its data and usually done before data is sent to the data warehouse for further investigation. There are packages available so you can clean/wash your address data when you put it in your system. This is usually done through the application programming interface (API).
Maps Data cleansing
Data quality â ⬠<â â¬
The term integrity includes accuracy, consistency and some aspects of validation (see also data integrity) but is rarely used by itself in the context of data cleaning because it is not specifically specific. (For example, "referential integrity" is a term used to refer to the enforcement of the above foreign-key constraint.)
Process
- Data auditing â ⬠: The data is audited with the use of statistical methods and databases to detect anomalies and contradictions: this ultimately gives an indication of their anomalous characteristics and their location. Some commercial software packages will let you define different types of restrictions (using grammar that conforms to a standard programming language, for example, JavaScript or Visual Basic) and then generate code that checks the data for violation of these restrictions. This process is called below in bullet "workflow specifications" and "workflow execution." For users who do not have access to sophisticated cleaning software, Microcomputer database packages such as Microsoft Access or File Maker Pro will also allow you to perform such checks, on an constrained-by-limiting basis, interactively with little or no programming in most cases..
- Workflow specifications : Anomaly detection and deletion is performed by a series of operations on data known as workflows. This is determined after the data audit process and is critical in achieving the end product of high quality data. To achieve the right workflow, the cause of anomalies and errors in the data should be carefully considered.
- Workflow execution : At this point, the workflow is executed after the specification has been completed and the veracity verified. Workflow implementation should be efficient, even on large data sets, which inevitably leads to trade-offs because the implementation of data-cleaning operations can be costly in computation.
- Post-processing and control : After running the cleanup workflow, the results are checked to verify the truth. Data that can not be corrected during the workflow implementation is manually corrected, if possible. The result is a new cycle in the data cleaning process where data is audited again to allow additional workflow specifications to further clean up data with automatic processing.
Good quality data sources relate to the "Data Quality Culture" and should start at the top of the organization. It's not just a matter of implementing a strong validation check on the input screen, since it hardly matter how strong these checks are, they are often still inaccessible by the user. There is a nine-step guide for organizations that want to improve data quality:
- Declare a high level commitment to data quality culture â â¬
- Encourage the execution process at the executive level
- Spend money on improving the data entry environment â â¬
- Spend money on improving app integration
- Spend the money to change how the process works
- Promote team awareness from end to end
- Promote interdepartmental cooperation
- General celebrate data quality excellence â ⬠<â â¬
- Continue to measure and improve data quality â â¬
Others include:
- Parse : to detect syntax errors. The parser decides whether the data string is acceptable in the allowed data specification. This is similar to how parsers work with grammar and language.
- Data transformation â ⬠<â ⬠: Data transformation allows data mapping from a given format to the format expected by the appropriate application. This includes the conversion of a value or a translation function, as well as normalizing numeric values ââto match the minimum and maximum values.
- Duplicate deletion : Duplicate detection requires an algorithm to determine whether the data contains duplicate images of the same entity. Typically, data is sorted by key that will bring closer duplicate entries together for faster identification.
- Statistical methods : By analyzing data using a mean value, standard deviation, range, or grouping algorithm, it is possible for an expert to discover unexpected and thus erroneous values. Although such data correction is difficult because the true value is unknown, it can be solved by assigning values ââto averages or other statistical values. Statistical methods can also be used to handle missing values ââthat can be replaced by one or more reasonable values, which are usually obtained with broad data augmentation algorithms.
System
The important work of this system is to find the appropriate balance between fixing dirty data and keeping the data as close as possible to the original data from the source production system. This is a challenge to Extract, transform, load architect. The system should offer an architecture that can clear data, record quality events and measure/control the quality of data in the data warehouse. A good start is to conduct a thorough data profiling analysis that will help determine the required complexity of the data cleaning system and also provide an overview of the quality of current data in the source system (s).
Tools
There are many data cleaning tools like Trifacta, OpenRefine, Paxata, Alteryx, and others. It's also common to use libraries like Pandas (software) for Python (programming languages), or Dplyr for R (programming languages).
One example of data cleansing for distributed systems under Apache Spark is called Optimus, an OpenSource framework for laptops or clusters that allows pre-processing, cleaning, and analyzing exploratory data. This includes some data managing tools.
Quality screen
Part of the data cleaning system is a set of diagnostic filters known as quality screens. They each apply a test in the data stream, if it fails to record errors in the Error Event Scheme. The quality screen is divided into three categories:
- column screen. Test individual columns, e.g. for unexpected value such as NULL value; non-numeric values ââthat must be numeric; outside the range of values; etc.
- Screen structure. This is used to test the integrity of different relationships between columns (usually foreign keys/primary) in the same or different tables. They are also used to test that a group of columns is valid according to some structural definitions that must be obeyed.
- Business rules screen. The most complex of the three tests. They test to see if the data, perhaps across multiple tables, follows certain business rules. For example, if a customer is marked as a customer of a certain type, a business rule that determines this type of customer must be obeyed.
When the quality screen records errors, it can stop the data flow process, send false data somewhere else from the target system or mark the data. The latter option is considered the best solution because the first option requires that one must manually handle the problem each time it occurs and the second implies that the data is missing from the target system (integrity) and it is often unclear what should happen to this data.
Criticism of existing tools and processes
Most data cleaning tools have limitations in usability:
- Project cost : fees are usually in the hundreds of thousands of dollars
- Time : mastering large-scale data-cleaning software takes time
- Security : cross-validation requires sharing information, giving application access across systems, including sensitive legacy systems
Incident error scheme
The Error Event Scheme keeps a record of all the error events thrown by the quality screen. It consists of an Event Error Event table with a foreign key to three dimension tables representing the date (when), the batch job (where) and the screen (which results in an error). It also stores information about when exactly the error occurred and the severity of the error. Additionally there is an Event Error Event Error table with a foreign key to the main table containing detailed information about where tables, records, and error fields occur and the error conditions.
Challenges and problems
- Error correction and loss of information : The most challenging issue in data cleansing is still value correction to remove duplicates and invalid entries. In many cases, the information available about such anomalies is limited and insufficient to determine the necessary transformations or corrections, allowing the removal of such entries as the primary solution. Removal of data, though, leads to loss of information; this loss can be very expensive if there is a large amount of data being deleted.
- Maintenance of cleaned data ââb>: Data cleanup is an expensive and time-consuming process. So after performing data cleansing and achieving error-free data collection, people will want to avoid re-cleaning the data entirely after some value in data collection changes. The process should only be repeated at a value that has changed; this means that the cleaning line should be maintained, which will require efficient data collection and management techniques.
- Data cleansing in a virtually integrated environment : In an integrated resource such as IBM DiscoveryLink, data cleaning should be done whenever data is accessed, which greatly increases response time and decreases efficiency.
- Data cleanup template â ⬠<â ⬠: In many cases it is not possible to obtain a complete data-cleaning chart to guide the previous process. This creates repeatable process data that involves significant exploration and interaction, which may require a framework in the form of a set of methods for error detection and elimination in addition to data audit. These can be integrated with other data processing stages such as integration and maintenance.
See also
- Data editing â ⬠<â â¬
- Data excavation â ⬠<â â¬
- Record link
- Single customer view
References
Source
- Han, J., Kamber, M. Data Mining: Concepts and Techniques , Morgan Kaufmann, 2001. ISBNÃ, 1-55860-489-8.
- Kimball, R., Caserta, J. ETL Data Warehouse Device , Wiley and Sons, 2004. ISBNÃ, 0-7645-6757-8.
- Muller H., Freytag J., Problems, Methods and Challenges in Comprehensive Data Cleanup , Humboldt-Universitat zu Berlin, Germany.
- Rahm, E., Hong, H. Data Cleanup: Recent Problems and Approach , Leipzig University, Germany.
External links
- Computerworld: Data Scrubbing (February 10, 2003)
- Erhard Rahm, Hong Hai Do: Data Cleanup: Recent Problems and Approach
Source of the article : Wikipedia