Tackling the Messy Data Problem using Simple, Natural Language






 

In the bustling world of data analytics, the term "messy data" is a frequent topic of discussion, often evoking a collective groan amongst data professionals. Messy data refers to information that is incomplete, inaccurate, inconsistent, or poorly formatted. Picture a customer database where some entries lack email addresses, others have incorrectly formatted phone numbers, and still others contain duplicate records. This chaos encompasses a range of issues from typos and missing values to inconsistent formats and duplications. Messy data lurks in every industry, ready to derail your best-laid plans.

According to Gartner, "Every year poor data quality costs organizations an average of $12.9M. This includes costs related to rework, fines, and lost business opportunities .

Why Does Messy Data Occur?

So, why does messy data occur in the first place? The breakdown can happen at various stages of data collection and processing. One common source is manual data entry errors, where human mistakes introduce inaccuracies. Another culprit is the integration of data from multiple sources, each with its own standards and formats, leading to inconsistencies. Additionally, legacy systems with outdated data storage methods can contribute to the problem. For example, merging old and new customer records might result in multiple entries for the same person due to variations in name spelling or address formatting. In short, messy data is the unfortunate lovechild of human error and technological evolution.

While messy data is a common occurrence in any industry that uses data, in this article we will dig deeper into messy data challenges in Supply Chain and Logistics, Manufacturing, and Revenue Operations industries. 

 

Messy Data in Supply Chain and Logistics

In supply chain and logistics, messy data can be particularly disruptive. The integration of information from suppliers, transport providers, warehouses, and retailers, each with different formats and standards, often leads to inconsistencies. For example, a logistics company might receive inaccurate shipment records, leading to incorrect inventory levels and causing either stockouts or excess inventory—both costly scenarios.

 

DHL has estimated that poor data quality in the supply chain and logistics sector can lead to a loss of around $1.3 billion annually. This figure includes costs associated with inefficiencies, delays, and mismanagement of inventory.

 

A notable example is the 2013 case of Target's failed expansion into Canada, where data discrepancies led to empty shelves in the stores and overstocked warehouses. Target attempted to consolidate its data from various sources, including manual entry of 100’s of 1000’s of items. Much of the data was entered incorrectly - widths were entered instead of lengths, and product names and descriptions were laden with typos and incomplete information. Inaccurate data resulted in poor inventory management. The fallout was catastrophic, leading to a $2 billion loss and Target’s eventual withdrawal from the Canadian market.

 

Messy Data in Manufacturing

Messy data is equally problematic in manufacturing. Data inaccuracies in production schedules, inconsistencies in quality control records, and poorly formatted supply chain data can all lead to significant disruptions.

 

A survey by LNS Research found that 20-30% of a manufacturing company’s total revenue can be wasted due to poor data management practices. This includes costs related to production inefficiencies, scrap and rework, and compliance issues .

 

For instance, Toyota's 2010 recall of over 10 million vehicles due to faulty accelerators was partly due to inconsistent and inaccurate manufacturing data. Quality control records and production data failed to flag the defects early, resulting in widespread safety issues. This not only caused a massive financial hit but also severely damaged the company's reputation. Similarly, in 1999 NASA lost a $125M Mars Climate Orbiter because of inconsistency in its units of measurement systems. The navigation team at the Jet Propulsion Lab (JPL) used the Metric system of millimeters and meters in calculations, while Lockheed Martin Astronautics which designed and built the spacecraft, provided data in English system of inches, feet, and pound. This discrepency resulted in severe navigation error causing the spacecraft to be pushed dangerously close to the planet’s atmosphere where it burned and broke into pieces.

 

Messy Data in Revenue Operations

In revenue operations, messy data can obscure insights and disrupt alignment between marketing, sales, and customer success teams. Inaccuracies in customer information, inconsistent sales figures, and missing data in marketing analytics can derail strategic initiatives and lead to inaccurate revenue forecasts.

 

IBM reported that companies with poor data quality can lose up to 12% of their revenue. This includes missed sales opportunities, inaccurate revenue forecasting, and inefficient marketing spend .

 

A prime example is Facebook's 2019 scandal over inflated video viewing metrics. The company reported significantly higher video view times due to a calculation error, misleading advertisers about the effectiveness of their campaigns. This messy data issue led to lawsuits, settlements, and a credibility hit for Facebook.

 

Strategies to Mitigate the Messy Data Problem

To tackle messy data, organizations can adopt several strategies:

  1. Implement Data Governance: Establishing clear data governance policies ensures data quality and consistency. This includes defining data standards, roles, and responsibilities for data management.
  2. Data Cleaning Tools: Utilizing appropriate data cleaning tools to automate the process of identifying and correcting errors and inconsistencies in datasets.
  3. Regular Data Audits: Conducting regular data quality audits helps in identifying and rectifying issues promptly, preventing them from escalating.
  4. Employee Training: Training employees on data entry best practices and the importance of data accuracy can significantly reduce human errors.

There are wide variety of data cleaning tools available in the market today which do a decent job of cleaning. But these tools are either cost prohibitive, complex to set up or have steep learning curve. More popular tools such as Excel, SQL, and Python make data cleaning very accessible and while powerful, they come with inherent challenges. Scalability, risk of human error, complexity, performance issues, and maintenance are common hurdles data professionals must navigate using these tools.

Excel is not designed to handle very large datasets efficiently. When working with extensive data, Excel can become slow, unresponsive, and prone to crashes. Data cleaning in Excel often involves a lot of manual work, which increases the risk of human error. Repetitive tasks can become tedious and time-consuming without advanced automation capabilities. Complex data transformations and validations can be cumbersome to implement in Excel.

Python is a versatile and powerful programming language, but it requires a steep learning curve. Writing efficient data-cleaning scripts requires proficiency in Python programming and familiarity with libraries like Pandas and NumPy. It can also suffer from performance issues when dealing with very large datasets. As is typical in any programmatic tool, code and dependency maintenance could also become barriers to efficient data cleaning and management efforts in Python.

 

Using Querri for making data cleaning a breeze

Querri, a Natural Language-powered data management tool, has democratized data cleansing and management by leveraging the power of human speech. By using simple, natural language, data professionals can accomplish complex data-cleaning tasks that would typically require Excel wizardry or programming skills. Querri gives users the power to talk with their spreadsheets!

The Querri team performed an experiment to compare the time and effort required to perform a simple data-cleaning task of converting and standardizing units of measurements between centimeters and feet and between kilograms and pounds using both Excel and Querri. While Excel required anywhere between 15 to 20 steps and a good 20-25 minutes of formula writing to accomplish the task, the same task could be performed within a couple of minutes using just one simple prompt in Querri!

“Querri prompt: Convert Height and Weight to Feet and Pounds.”

There are several benefits to using Querri’s natural language capabilities for data cleaning.  

  • Querri democratizes data cleaning and management: Any English speaker can now become adept at cleaning data without having to undergo specialized training.
  • Querri saves time. Simplicity of prompting = time savings.
  • Querri speeds up decision-making. With the ability to draw complex charts and graphs and make forecasts and predictions using natural language, decision-makers can now analyze data within a matter of hours instead of days.
  • Querri automates repetitive tasks. There is nothing as burdensome and boring as performing repetitive tasks. With Querri, you can automate repetitive tasks and schedule them to run automagically on your required schedule.

But don't take our word for it. Try it for yourself and witness how the power of your words can help you become the Marie Kondo of data cleaning.

Similar posts