Two of the most valuable assets in any company are data and people, and there’s a symbiotic relationship between the two. People leverage technology to drive innovation and growth from their data. Larger companies generally have more data than they can effectively manage, and not getting convenient access to this data often becomes a barrier to the innovation and growth which companies desire. Data virtualization makes connecting people and technology to the multitude of data stores faster, easier, and more manageable.
But like most tech-driven solutions, data virtualization isn’t one-size-fits-all. It’s evolving quickly, pivoting to satisfy the different needs of a highly segmented, always demanding marketplace. So, let’s take a closer look at data virtualization, how it is evolving, and what it means for companies.
Today’s data virtualization space is messy, with a lack of coherence across most of the different solutions and providers labeling themselves as “data virtualization.” Our view on data virtualization aligns closely with Gartner’s definition:
Data virtualization technology is based on the execution of distributed data management processing, primarily for queries, against multiple heterogeneous data sources, and federation of query results into virtual views. This is followed by the consumption of these virtual views by applications, query/reporting tools, message-oriented middleware or other data management infrastructure components. Data virtualization can be used to create virtualized and integrated views of data in-memory, rather than executing data movement and physically storing integrated views in a target data structure. It provides a layer of abstraction above the physical implementation of data, to simplify querying logic.
Storage virtualization tools like Delphix, ETL integration approaches, and reporting tools like Tableau are not truly data virtualization solutions. Whereas federated platforms like Denodo and Tibco and reverse-federated platforms like Gluent would be.
Whether a federated solution or a more transparent, reverse-federated approach, the goal of data virtualization is to provide efficient, convenient data access, regardless of data store technology or physical location. By inserting an abstraction layer between applications and disparate data sources, such as data warehouses, data lakes, or cloud-native big data repositories, you decouple the use of the data from its technology and location. This decoupling removes barriers and allows companies to move more quickly with their data initiatives.
For simplicity’s sake, we’re going to refer to federated platforms like those from Denodo and Tibco as Data Virtualization 1.0. These types of solutions typically have three design characteristics:
Do these 1.0 solutions provide value? Yes. Unfortunately, these tools also come with some drawbacks:
So, what differentiates Data Virtualization 2.0 tools from the previous generation? In many ways, it's simply a natural evolution that attempts to resolve the major issues currently plaguing this space. This next generation of data virtualization should facilitate data centralization, lower infrastructure costs, and be transparent to data consumers.
It’s obvious that cost and transparency should be a focus, but why data centralization? Often, enterprises are saddled with duplicate legacy systems, including multiple data lakes, data marts, and data warehouses. Consolidating this data can provide tangible, monetary benefits, and significantly improve performance.
Unfortunately, in many cases, the cost of rewriting the affected applications outweighs the benefits of consolidation. Data Virtualization 2.0 should enable “codeless” data migrations to a highly performant, centralized data platform without breaking the data consumers’ access or compromising security.
There are four requirements to accomplishing these goals:
Meeting these requirements resolves the major issues with Data Virtualization 1.0 platform by:
With Data Virtualization 2.0, cloud migrations will become much easier and much faster. With no programming required to move the data and no need to rewrite the applications accessing the data, migrations can occur in a few weeks as opposed to months or years. Once the data is in place, enterprises can start taking advantage of more advanced capabilities of these modern new platforms such as AI and machine learning.
Most enterprises are laden with legacy systems that are too expensive to replace due to complexity, cost, or lack of subject matter expertise. These legacy systems, where most data continues to be born, are not well integrated. The most common form of integration today is simply copying the data from one system to another, resulting in additional costs. Increased infrastructure is needed to store and process this duplicate data.
Likewise, additional resources are needed to maintain the growing web of ETL processes which duplicate data. Data Virtualization 1.0 attempted to solve this problem by eliminating the need to copy data between systems, but has not been widely accepted, primarily due to architectural shortcomings which have created significant roadblocks to its adoption. Data Virtualization 2.0 attempts to eliminate those roadblocks, enabling rapid migration to the cloud, improvements in performance, and reduction in costs, all without the expense and risk associated with rewriting the existing codebase.