By Jelani Harper

One of the most significant ramifications of the increasing decentralization found throughout the contemporary data landscape is the need to move data, particularly for cloud and analytics deployments.

The cloud has become essential for integrating data of myriad source systems. It not only functions as a means of tying together various geographically dispersed data, but also commonly found sources such as social media, mobile technologies, on-premise applications, and edge computing or IoT use cases.

When one factors in all of the various options for engaging in aspects of machine learning and cognitive computing in the cloud (predominantly via Service Oriented Architecture), the need to consistently migrate data to and from the cloud is one of the most pressing concerns for the enterprise today—especially in the wake of more comprehensive realities including disaster recovery, business continuity, and system backups.

The problem, however, is moving data has traditionally been considered a bottleneck, particularly when the data ecosystem was populated by ponderous, batch-oriented systems. However, there are a number of innovations in the data replication space which have vastly transformed what was once considered a fundamental problem into an opportunity for profit.

“There’s an ongoing need for data integration, and we think that continues to be the case,” HVR CTO Mark Van de Wiel said. “The flavors of the technology that actually store the data, they come and go. But especially now with the cloud making it really easy to spin up services in a scalable manner with the pay-as-you go model, it makes it really effective for an organization to actually shift [resources] from one [place] to another.”

Effectual Cloud Migrations

Perhaps one of the most efficacious of the modern approaches for replicating data to and from the cloud involves Change Data Capture (CDC). After an initial batch migration, this method allows organizations to only replicate the changes of their data in near real-time. Although the means for implementing CDC vary, one of the more beneficial is to “actually read the transaction logs,” HVR Software CEO Anthony Brooks-Williams noted. “Every database has transaction logs which are typically used to back up the databases every 10 minutes, 15 minutes, 5 minutes, [or] couple of hours so that they can recover to the nearest point. The easiest and the most efficient way to do this Change Data Capture is to scrape those log files.” Those replicating incremental data updates via CDC in this manner actualize a number of advantages for their computing environments, including:

Low Latency—By only sending the changes to the various datasets involved, as opposed to replicating all of those datasets, organizations are able to migrate data in close to real-time.

Network Resource Utilization—By only replicating the changes to the data, organizations are also able to optimize network resources. Van de Wiel referenced a SAP use case which consists of well over 30,000 tables. “Usually customers are not interested in changes to all 30,000 tables, but only the 100 that they’re actually interested in,” Van de Wiel acknowledged. “We’re moving the changes of only the few 100 tables.”

Compression techniques further decrease network strain and increase network utilization.
Low Source System Impact—Basing CDC on log data minimizes any impact to the source system, particularly when the replication method is asynchronous. “There’s other means of doing a Change Data Capture,” Van de Wiel mentioned. “But some of those other approaches impact the source application.” Impacting those applications can slow the process, increasing the amount of network utilization required for what should be timely replication.

Real-Time Financial, Business Value

It’s critical to note the correlation between the preceding computing benefits and those impacting an organization’s finances. Cloud payment methods are predicated on the amount of usage, so the less time spent replicating data to and from the cloud, the less organizations pay for doing so. Firms have similar financial responsibilities for bandwidth; the less bandwidth required for data replication, the lower this expense is. Moreover, the less the source system is impacted, the greater its productivity and efficiency.

Nonetheless, the greater financial benefits likely stem from the business value generated from being able to swiftly integrate data from different sources for a plenitude of enterprise purposes. Van de Wiel described a financial services use case in which traders analyze a real-time transactional system in conjunction with an assortment of other sources, including social media and stock market analyses, to “analyze the data that’s fed into real-time in other systems in order to minimize risk and make trading decisions.” One of the most common use cases for migrating data with sources involving the cloud is to feed data lakes. In almost each of these scenarios, organizations are able to gain from the ready accessibility and integration of co-located data.

Decentralised Applications

The use cases and opportunities for profiting from replicating data in near real time are certainly abundant. However, there are still other pragmatic realities for which replicating data to the cloud is becoming almost a de facto standard. Foremost among these is simply the decentralized data landscape, and the fact that very rarely is an organization’s data all sitting on-premises behind the comfort of its own firewall.

Issues such as regulatory compliance and others require for certain data to be located in one location, whereas other locations are advantageous for additional data sources. “For some applications, and arguably some of them are more legacy, although there’s some newer applications to which this applies as well, it’s really important to have the data local to the user or the application,” Van de Wiel said.

When leveraging this data with sources located in other areas, asynchronous cloud replication via CDC is a viable option. “If we’re connecting from here in the United States to a system that’s hosted in Australia or India, we notice that the delays are much higher compared to the same application running in a local data center,” Van de Wiel maintained. Other commponlace reasons for replicating data to the cloud involve high availability, and the capability for organizations to quickly recover from disaster recovery for business continuity.

Additional Possibilities

Moving data is no longer the bottleneck it once was when batch-oriented processing dominated the data landscape. Replication approaches such as CDC (and log-based CDC in particular) enable data to be transferred between locations quickly, cost-effectively, and efficiently for network utilization and source systems.

As such, the opportunities for actually profiting from the replication—and ensuing integration—of data between geographic locations are simply expanding, keeping pace with the expansion of the distributed data sphere itself. “The whole premise is based around the idea of heterogeneous data integration, and making it easy…to add sources,” Van de Wiel explained.

Jelani Harper is an editorial consultant servicing the information technology market, specializing in data-driven applications focused on semantic technologies, data governance and analytics.