Garbage In, Insights Out: Making Sense Of Messy Data With Querri
Unlock the hidden potential of your chaotic datasets with Querri's advanced data cleaning and preparation tools.
In the bustling world of data analytics, the term "messy data" is a frequent topic of discussion, often evoking a collective groan amongst data professionals. Messy data refers to information that is incomplete, inaccurate, inconsistent, or poorly formatted. Picture a customer database where some entries lack email addresses, others have incorrectly formatted phone numbers, and still others contain duplicate records. This chaos encompasses a range of issues from typos and missing values to inconsistent formats and duplications. Messy data lurks in every industry, ready to derail your best-laid plans.
According to Gartner, "Every year poor data quality costs organizations an average of $12.9M. This includes costs related to rework, fines, and lost business opportunities .
So, why does messy data occur in the first place? The breakdown can happen at various stages of data collection and processing. One common source is manual data entry errors, where human mistakes introduce inaccuracies. Another culprit is the integration of data from multiple sources, each with its own standards and formats, leading to inconsistencies. Additionally, legacy systems with outdated data storage methods can contribute to the problem. For example, merging old and new customer records might result in multiple entries for the same person due to variations in name spelling or address formatting. In short, messy data is the unfortunate lovechild of human error and technological evolution.
While messy data is a common occurrence in any industry that uses data, in this article we will dig deeper into messy data challenges in Supply Chain and Logistics, Manufacturing, and Revenue Operations industries.
In supply chain and logistics, messy data can be particularly disruptive. The integration of information from suppliers, transport providers, warehouses, and retailers, each with different formats and standards, often leads to inconsistencies. For example, a logistics company might receive inaccurate shipment records, leading to incorrect inventory levels and causing either stockouts or excess inventory—both costly scenarios.
DHL has estimated that poor data quality in the supply chain and logistics sector can lead to a loss of around $1.3 billion annually. This figure includes costs associated with inefficiencies, delays, and mismanagement of inventory.
A notable example is the 2013 case of Target's failed expansion into Canada, where data discrepancies led to empty shelves in the stores and overstocked warehouses. Target attempted to consolidate its data from various sources, including manual entry of 100’s of 1000’s of items. Much of the data was entered incorrectly - widths were entered instead of lengths, and product names and descriptions were laden with typos and incomplete information. Inaccurate data resulted in poor inventory management. The fallout was catastrophic, leading to a $2 billion loss and Target’s eventual withdrawal from the Canadian market.
Messy data is equally problematic in manufacturing. Data inaccuracies in production schedules, inconsistencies in quality control records, and poorly formatted supply chain data can all lead to significant disruptions.
A survey by LNS Research found that 20-30% of a manufacturing company’s total revenue can be wasted due to poor data management practices. This includes costs related to production inefficiencies, scrap and rework, and compliance issues .
For instance, Toyota's 2010 recall of over 10 million vehicles due to faulty accelerators was partly due to inconsistent and inaccurate manufacturing data. Quality control records and production data failed to flag the defects early, resulting in widespread safety issues. This not only caused a massive financial hit but also severely damaged the company's reputation. Similarly, in 1999 NASA lost a $125M Mars Climate Orbiter because of inconsistency in its units of measurement systems. The navigation team at the Jet Propulsion Lab (JPL) used the Metric system of millimeters and meters in calculations, while Lockheed Martin Astronautics which designed and built the spacecraft, provided data in English system of inches, feet, and pound. This discrepency resulted in severe navigation error causing the spacecraft to be pushed dangerously close to the planet’s atmosphere where it burned and broke into pieces.
In revenue operations, messy data can obscure insights and disrupt alignment between marketing, sales, and customer success teams. Inaccuracies in customer information, inconsistent sales figures, and missing data in marketing analytics can derail strategic initiatives and lead to inaccurate revenue forecasts.
IBM reported that companies with poor data quality can lose up to 12% of their revenue. This includes missed sales opportunities, inaccurate revenue forecasting, and inefficient marketing spend .
A prime example is Facebook's 2019 scandal over inflated video viewing metrics. The company reported significantly higher video view times due to a calculation error, misleading advertisers about the effectiveness of their campaigns. This messy data issue led to lawsuits, settlements, and a credibility hit for Facebook.
To tackle messy data, organizations can adopt several strategies:
There are wide variety of data cleaning tools available in the market today which do a decent job of cleaning. But these tools are either cost prohibitive, complex to set up or have steep learning curve. More popular tools such as Excel, SQL, and Python make data cleaning very accessible and while powerful, they come with inherent challenges. Scalability, risk of human error, complexity, performance issues, and maintenance are common hurdles data professionals must navigate using these tools.
Excel is not designed to handle very large datasets efficiently. When working with extensive data, Excel can become slow, unresponsive, and prone to crashes. Data cleaning in Excel often involves a lot of manual work, which increases the risk of human error. Repetitive tasks can become tedious and time-consuming without advanced automation capabilities. Complex data transformations and validations can be cumbersome to implement in Excel.
Python is a versatile and powerful programming language, but it requires a steep learning curve. Writing efficient data-cleaning scripts requires proficiency in Python programming and familiarity with libraries like Pandas and NumPy. It can also suffer from performance issues when dealing with very large datasets. As is typical in any programmatic tool, code and dependency maintenance could also become barriers to efficient data cleaning and management efforts in Python.
Using Querri for making data cleaning a breeze
Querri, a Natural Language-powered data management tool, has democratized data cleansing and management by leveraging the power of human speech. By using simple, natural language, data professionals can accomplish complex data-cleaning tasks that would typically require Excel wizardry or programming skills. Querri gives users the power to talk with their spreadsheets!
The Querri team performed an experiment to compare the time and effort required to perform a simple data-cleaning task of converting and standardizing units of measurements between centimeters and feet and between kilograms and pounds using both Excel and Querri. While Excel required anywhere between 15 to 20 steps and a good 20-25 minutes of formula writing to accomplish the task, the same task could be performed within a couple of minutes using just one simple prompt in Querri!
“Querri prompt: Convert Height and Weight to Feet and Pounds.”
There are several benefits to using Querri’s natural language capabilities for data cleaning.
But don't take our word for it. Try it for yourself and witness how the power of your words can help you become the Marie Kondo of data cleaning.
Unlock the hidden potential of your chaotic datasets with Querri's advanced data cleaning and preparation tools.
What is data governance in manufacturing and how to calculate the ROI of data governance
Re-imagine your data workflow! Clean, move, and analyze data effortlessly with Querri using GenAI. Early Access available now. Empower your data...