How Much Does Your Data Duplication Problem Cost Your Enterprise?

Data is a crucial component of today’s business. Companies require data to embark on a digital transformation journey, instantiate a data science practice, or do merger analysis. How effectively companies manage their data can be the difference between a good decision and a bad one. How quickly companies can leverage their data can be the difference between a new opportunity and one that is lost. 


Companies with efficient and effective practices for making data-informed decisions will be best equipped for continued future success. Conversely, companies without those practices will be at a disadvantage. Incredibly, poor data management costs companies an estimated $3.1 trillion annually in the U.S. alone.

A top inhibitor to sound data management is duplicate data. There are many reasons why companies duplicate so much data. More often than not, expediency and perceived cost avoidance are the primary reasons. Duplicating data can be the path of least resistance when dealing with system interfaces – nightly refreshes, for example. Unfortunately, although the decision to build ETL pipelines to replicate data seems like the easiest approach, these types of practices actually drive up costs for those organizations. We’re going to take a closer look at the sources of this problem and how enterprises can address them.

Storage Expense

Obviously, storing the same data multiple times will drive up storage costs. It’s not just storage costs, though.  Servers need to attach to storage and, typically, database software is required to manage access to the data. For more modern architectures, this may include SaaS with a usage-based charge in addition to the storage costs.

How often do companies keep historical data within a billing system only to store the same data in a data warehouse and a CRM? Telecommunications companies are notorious for keeping multiple copies of CDR records scattered throughout the organization, hidden away on mainframes, data lakes, or in the cloud. In our own experience, it is not uncommon to see data redundancy rates as high as 50%. 

Regardless of how cheap storage may be, rampant data duplication does create substantial costs for enterprises. These costs typically show up as infrastructure costs. And while some companies are mature enough to allocate costs by application or even by business function, few companies have the maturity to understand what data is duplicated between the applications it manages. As a result, IT Operations leaders buy storage in bulk and usually overbuy because of the length of time required for procurement and deployment.  

ETL and Labor Costs

Creating interfaces between systems by using ETL to copy data is a simple and fast approach that often has little upfront costs. Over time, though, these copies of data will begin to diverge. Whether because of inconsistent timing, code bugs, or modifications to business logic, the data will tend to become out of sync, or drift. This is an insidious form of bad data. The long-term impacts of such data problems can be staggering. Why is bad data estimated to cost U.S. companies $3.1 trillion annually? A primary factor is the hidden labor costs of reconciling  inconsistent data.

Business analyst teams, which many companies staff with ETL developers, can spend as much as 80% of their time maintaining data pipelines, leaving very little time devoted to analyzing data. Additional costs associated with this practice of copying data include highly coupled systems – an architectural nightmare. An enhancement in one system can spawn changes in many downstream applications. Altogether, development costs go up while productivity and innovation decrease.

Inaccurate KPIs, Reporting, and Audit Issues

Peter Drucker famously stated, “If you can’t measure it, you can’t improve it.” In the quest for improvement, companies rely on key metrics to gauge and predict performance against their internal targets as well as the competition. However, when leadership relies on duplicative data to generate its KPIs, accuracy will inevitably be an issue. Alternatively, when companies use a reliable, single source of truth, they establish an accurate barometer to measure where they stand, ultimately driving better decision making and profitability. Studies have shown that 12% of overall revenue is wasted due to data quality issues.

The same issue can also impact financial reporting. Information sourced from multiple data sets spread out across various systems can quickly undermine the accuracy of the critical record-to-report function. Not only will this skew financial performance metrics, it can also negatively impact compliance obligations. No enterprise wants to restate or delay its financials due to data integrity issues.

Similarly, unnecessary data duplication also makes SOX and PCAOB audits even more arduous. Future compliance changes should also be considered when assessing the impact of data duplication and its inherent challenges. A recent example is revenue recognition. Many companies were challenged with revenue recognition policies, in large part, due to the additional data needed to categorize accounting transactions. Bringing in supplementary, inconsistent data to an ERP can create a host of issues.

Customer Service and Engagement

A CRM is supposed to streamline communication with customers, drive better service, and increase engagement with the target audience through a unified platform. However, even the most effective CRM systems are only as good as the data feeding them. When customer records have duplicate entries, missing information, or a mismatch in data due to system extract timings, improving the customer experience can be challenging. 

A common scenario would be a customer calling in with a question regarding an order made earlier in the day, but the CRM has no record of it due to nightly refreshes. The customer service rep will swivel-chair between systems to answer the customer’s question accurately. A worse situation would be order status mismatches between the CRM and the ordering system. Imagine the customer’s confusion and the additional overhead customer service reps incur.

Building a CRM system off of duplicated data will eventually have to be rearchitected to achieve its expected benefits. This is the equivalent of building a house but then immediately undertaking a massive and sometimes protracted remodeling effort.

Bad data doesn’t just impact servicing existing customers but also selling to existing and prospective customers. If marketing teams cannot eliminate duplicate records during marketing campaigns, a company runs the risk of annoying its targeted audience. According to Gartner, that annoyance can trigger a 25% reduction in potential revenue.

These interactions and miscommunications can result in poor experiences, which, in turn, can generate negative reviews. Recent research indicates that just one negative review on social media equates to a 22% drop in potential customers, jumping to 59% with three bad reviews. 

Addressing the Data Quality Problem

So how does an enterprise address these data issues? The company must eliminate the copying of data and address the root causes leading to that behavior. This requires organizations to invest in a systematic, targeted cleanup of the environment, to establish some level of data governance, and to address any technology and infrastructure gaps. Technology and infrastructure gaps with data sharing are common and often result in individual teams filling the void using ETL.  

In many cases, there are perceived barriers driving teams to duplicate data. The idea that IT and business users cannot coexist in the same environment encourages business teams to build their own systems – duplicated data included. This belief is founded on valid experiences – change control, lack of supporting infrastructure, lack of communication, and prioritization conflicts, to name a few. Ultimately, companies need to find a way to make sharing data easier than duplicating data. One such method is through data virtualization.

Enabling users and applications to access data easily and seamlessly removes the need to make copies of data. Allowing those users and applications to access all data from a centralized data store using existing SQL interfaces mitigates any need to rewrite code. Of course, this requires that data be moved, one time, from the system where it was born, and then presented to anyone that wants it and has the necessary privileges to access it. Transparent data virtualization solutions do this.

While transparent data virtualization is just part of an overall data management solution, implementing such a solution will create a path for improved data sharing. It will help eliminate the duplicative data problem. Business users will spend more time using data and less time copying data. The results will be improved quality, greater innovation, and better data-driven decisions.