Data Virtualization 2.0: Better, Faster, Cheaper

Two of the most valuable assets in any company are data and people, and there’s a symbiotic relationship between the two. People leverage technology to drive innovation and growth from their data.  Larger companies generally have more data than they can effectively manage, and not getting convenient access to this data often becomes a barrier to the innovation and growth which companies desire. Data virtualization makes connecting people and technology to the multitude of data stores faster, easier, and more manageable.

Gluent1
But like most tech-driven solutions, data virtualization isn’t one-size-fits-all. It’s evolving quickly, pivoting to satisfy the different needs of a highly segmented, always demanding marketplace.  So, let’s take a closer look at data virtualization, how it is evolving, and what it means for companies.

Data Virtualization: The 30,000 Foot View

Today’s data virtualization space is messy, with a lack of coherence across most of the different solutions and providers labeling themselves as “data virtualization.” Our view on data virtualization aligns closely with Gartner’s definition:

Data virtualization technology is based on the execution of distributed data management processing, primarily for queries, against multiple heterogeneous data sources, and federation of query results into virtual views. This is followed by the consumption of these virtual views by applications, query/reporting tools, message-oriented middleware or other data management infrastructure components. Data virtualization can be used to create virtualized and integrated views of data in-memory, rather than executing data movement and physically storing integrated views in a target data structure. It provides a layer of abstraction above the physical implementation of data, to simplify querying logic.

Storage virtualization tools like Delphix, ETL integration approaches, and reporting tools like Tableau are not truly data virtualization solutions.  Whereas federated platforms like Denodo and Tibco and reverse-federated platforms like Gluent would be.

Whether a federated solution or a more transparent, reverse-federated approach, the goal of data virtualization is to provide efficient, convenient data access, regardless of data store technology or physical location. By inserting an abstraction layer between applications and disparate data sources, such as data warehouses, data lakes, or cloud-native big data repositories, you decouple the use of the data from its technology and location.  This decoupling removes barriers and allows companies to move more quickly with their data initiatives.

Gluent2

Data Virtualization 1.0: Federated Solutions

For simplicity’s sake, we’re going to refer to federated platforms like those from Denodo and Tibco as Data Virtualization 1.0. These types of solutions typically have three design characteristics:

  1. Data remains in place, avoiding any movement or ETL process
  2. Access is provided to all the data via a new SQL engine interface.
  3. Performance issues are mitigated by caching data.

Do these 1.0 solutions provide value? Yes. Unfortunately, these tools also come with some drawbacks:

  • No reduction in expenses. In fact, federated solutions actually increase infrastructure costs.
  • Applications must be rewritten to leverage the new SQL interface.
  • Large result sets from multiple data stores create performance problems due to network latency.

Data Virtualization 2.0: Transparent Solutions

So, what differentiates Data Virtualization 2.0 tools from the previous generation? In many ways, it's simply a natural evolution that attempts to resolve the major issues currently plaguing this space. This next generation of data virtualization should facilitate data centralization, lower infrastructure costs, and be transparent to data consumers.

It’s obvious that cost and transparency should be a focus, but why data centralization? Often, enterprises are saddled with duplicate legacy systems, including multiple data lakes, data marts, and data warehouses. Consolidating this data can provide tangible, monetary benefits, and significantly improve performance.

Unfortunately, in many cases, the cost of rewriting the affected applications outweighs the benefits of consolidation.  Data Virtualization 2.0 should enable “codeless” data migrations to a highly performant, centralized data platform without breaking the data consumers’ access or compromising security.

There are four requirements to accomplishing these goals:

  1. Data should be automatically migrated to a centralized platform – no programming or ETL required.
  2. Application access to data moved to the centralized platform should be transparent. No code rewrites required, and the applications should be unaware the data has been moved.
  3. All data should be available to any application with appropriate privileges via a transparent presentation layer. 
  4. Processing should be pushed to the centralized platform to eliminate the necessity of copying large result sets across the network. 

Gluent3
Meeting these requirements resolves the major issues with Data Virtualization 1.0 platform by:

  • Dramatically lowering costs. Data is no longer replicated to multiple systems, but centralized into a single less expensive platform from which other systems can transparently access the data.  Storage requirements go down, and on-premises CPU utilization drops, resulting in a reduction of hardware and CPU-related software costs.  
  • Eliminating application rewrites.  Applications’ access to the data remains unchanged through the original SQL engines.
  • Addressing performance issues associated with joining large data sets across a network.

Additional Benefits of Data Virtualization 2.0

With Data Virtualization 2.0, cloud migrations will become much easier and much faster. With no programming required to move the data and no need to rewrite the applications accessing the data, migrations can occur in a few weeks as opposed to months or years. Once the data is in place, enterprises can start taking advantage of more advanced capabilities of these modern new platforms such as AI and machine learning.

The Bottom Line

Most enterprises are laden with legacy systems that are too expensive to replace due to complexity, cost, or lack of subject matter expertise. These legacy systems, where most data continues to be born, are not well integrated. The most common form of integration today is simply copying the data from one system to another, resulting in additional costs. Increased infrastructure is needed to store and process this duplicate data.

Likewise, additional resources are needed to maintain the growing web of ETL processes which duplicate data.  Data Virtualization 1.0 attempted to solve this problem by eliminating the need to copy data between systems, but has not been widely accepted, primarily due to architectural shortcomings which have created significant roadblocks to its adoption. Data Virtualization 2.0 attempts to eliminate those roadblocks, enabling rapid migration to the cloud, improvements in performance, and reduction in costs, all without the expense and risk associated with rewriting the existing codebase.