Data Quality Issues in Big Data
What is Big Data?
Large volume of both structured and unstructured data that inundates a business on a day to day life. Big data can be examined for visions that lead to better conclusions and tactical business changes. The term big data is a bit relatively new but the process of gathering and storing huge amount of information for study is going on since ages. But in recent years big data gained its momentum when the definition of big data was defined as three V's:
Volume: Due to the flexibilities offered by the new available technologies storing of the huge amount of data collected by any organization became stress-free. These organizations may collect the required data through various sources such as social media, sensor and business transactional data.
Variety:To be precise the data collected will be in numerous formats which includes structured, unstructured, audio, videotape, financial data and many more.
Velocity: The speed with which this data can be accessed should be very high and also in the timely manner.
Importance of Big Data
The importance of Big Data do not lies in the amount of data that is saved but it only deals with what can be achieved using the stored data. Many conclusions can be made by carefully examining the data that has been collected. Companies will use this data to find answers that enable smart decision making, developing a new product, optimizing offers, cost and time reduction. Analytics combined with this big data many business related responsibilities can be accomplished. Some of them are as follows:
Ø Promotions can be done only to the users who are interested in such products.
Ø Calculating the risks in very short span.
Ø Fake behavior can be detected before incurring any losses to the business.
Ø Based on the customers buying practices, coupons can be generated.
Ø Root cause failures can be identified.
How Big Data works?
Data can be collected from various sources and which eventually falls into one of the three below groups:
Streaming Data: The data that is being stored while people stream on the internet comes under this category. The data reaches the IT systems decision should be made on what data should be stored for further analysis and what not to be stored.
Social Media Data: The data that is generated due to interactions on the various social media comes under this group. This is much helpful particularly for sales purpose. Many times this data will be in unstructured form and has its own limitations.
Publicly Available Sources: Large amount of data is available in open data sources which includes the public forum, public websites etc.
After collecting the data from various sources, decisions can be made on the potential data which included the following:
Ø How data can be stored and managed?
Ø How much of the available data can be analyzed and to what extent?
Ø How to use any understandings you discover?
Challenges of Data Quality in Big Data
As per the definition of the Big Data, data can be collected through various sources and the same is stored, which can be later analyzed for the benefits of their own company. Business strategies can be developed and implemented by analyzing the data collected. It is an estimation that the problems in the data quality in the big data results in the loss of around $600 billion per year for the US businesses. The key issue is that as time pass on, the quality of the data degenerates. It's believed by experts that data in the customer records becomes outdated and can used further, as this might be caused due to various reasons such as the changes in the habitual actions of the customer, they may get married and move to some other places etc. Also there might be possibility of data entry errors, system generated errors, errors due to system migration etc. Irrespective of the reasons, it is the responsibility of the company to treat the data as a strategic factor element, to develop a program to manage the data quality and also to hire data quality professionals.
Couple of the most issues caused due to the poor data quality is loss of reliability in the system and extra time required to reconcile the data. Some other issues might also include customer dissatisfaction, delay in developing the new system and also revenue losses.
Defective data cause a list of problems. With the inconsistency of issues in the data, the companies will be unable to make their own decisions or even cannot have accurate understanding of what is happening in the system. In such vital situations they rely on some other organization, which is considered to be very dangerous in this fast moving market where the competitors will easily make advantage to overcome.End users may lose the confidence in the system due to the inability to reconcile the data between data warehouse and the source system. Many companies quote additional cost due to the lost revenue, lost discounts, excess inventory, delays in developing and deploying the new systems to overcome the financial losses due to the data quality issues. Poor data quality has undermined strategic plans or projects.
Managing Data Quality in Big Data
Some companies might get all the pieces in place to handle the data quality that exists today and tomorrow some new problems might arises and the process continues. So, managing data quality is a never ending process. The main reason behind this is that customer requirements, business rules, expectations, business processes, will keep on changing from day to day. To safeguard high quality data, companies need to increase commitment to data quality management principles and develop processes and programs that reduce data defects.
Data Quality Methodology:
1. Launch of Data Quality Program: To delivery high quality data, high quality performers are essential and so top managers needs to take the responsibility to heir the executives.
2. Develop a Project Plan: A perfect plan needs to be developed and sometimes a series of plans are needed. The complete project plan should include scope of the activity, set up goals, identify actions, and measure and monitor success.
3. Build a Data Quality Team positions: To implement the data quality plan, a perfect team is needed and resources needs to be hired.The team must comprise of Chief Quality Officer, Data Steward, Subject Matter Expert, Data Quality Leader, Data Quality Analyst, Data Quality Trainer,and Process Improvement Facilitator.
4. Review Business Processes and Data Architecture: Once the plan is developed, the senior manager representative needs to review the entire business process plan for collecting, recording,and using data in the subject areas defined by the scope document.System architecture needs to be evaluated which supports the business practices and information flows.
5. Assess Data Quality, Data Auditing: The organization needs to undertake proper assessment of data quality after reviewing the business processes. The key aim of this assessment is to identify common data defects, create metrics to detect the defects, and creating the rules to fix the defects in data.
6. Clean the Data: The data cleaning job begins once the audit is completed. To minimize the costs the defects are needed to be detect as soon as possible. There are four basic methods in which data can be cleaned. Correct, Filter, Detect and Report, Prevent are those four basic methods.
7. Monitor Data: Tomonitor data quality, companies need to build a program that audits, data atregular intervals, or just before or after data is loaded into another systemsuch as a data warehouse.
Conclusion
The further we move, the more significant it will be for companies to invest in maintaining good quality data.Companies which manage their data as a planned resource and capitalize in its quality are already pulling ahead in terms of repute and profitability of those that fail to do so. So finally it is the responsibility of the companies to handle the issues in the data quality by following perfect methodologies.