From Source to Destination with the Scramjet

Understanding the Value of Data Integration in Business

The value of applying knowledge based on data in business activity lies at the very heart of data integration. When businesses have a solid grasp on their data - where it comes from, where it's going, how it's transformed along the way - they're better equipped to make strategic decisions, optimize operations, drive innovation, and enhance customer experiences. Indeed, the actual goal of data integration is to apply the insights gained from data to foster more informed decision-making and improve overall business performance.

However, achieving this ideal state is often complicated by the complexity of connecting distributed environments. Data in today's digital age doesn't just sit in one place - it's scattered across different databases, applications, systems, and even geographical locations. Integrating this data requires crossing networks, connecting APIs, and ensuring rigorous security measures are in place - all of which can significantly extend the time required for any integration.

The Complexities of Connecting Distributed Environments

Take the common adage in the industry: "the customer wants dashboards". This saying captures the desire for readily available, visual representations of data that provide meaningful insights at a glance. Yet, a dashboard or a cloud API is not the end of the process - it's merely a tool for viewing and analyzing the data. The actual value lies in the actionable insights derived from this data, and the ability to use these insights to drive business performance.

Moreover, there's a prevailing shortfall in the current cloud industry: the inability to immediately utilize the data. Once data has been collected and processed, it's often transferred to the cloud, where it may sit idle before it can be actioned upon. This lag time reduces the real-time relevance of the data and diminishes its potential impact.

A solution to this challenge could be a unified runtime supervisor and data exchange platform. This platform would not only oversee the process of data integration from end to end but also ensure that the data is instantly usable at the destination point. It would bridge the gap between data source and data destination, enabling businesses to swiftly capitalize on their data, regardless of where it originates or where it's processed.

This solution, if implemented effectively, could transform the landscape of data integration. It could enhance data utility, improve data-driven decision-making, and offer businesses the ability to extract the most value from their data in real-time. Thus, this unified platform could serve as a crucial component in businesses' ongoing quest to become truly data-driven.

This was the idea that sparked Scramjet as a company back in May 2020.

Possible Data Sources and the Value of Processing Data at Source

Data sources can be extraordinarily diverse, both in nature and in complexity. Here are a few examples:

Databases: These can be SQL (like PostgreSQL, MySQL, Oracle) or NoSQL (like MongoDB, Cassandra), or even new-age databases like graph databases (Neo4j) or columnar databases (Google BigTable, Amazon Redshift).
APIs: Many software tools and platforms offer APIs that allow businesses to directly pull data for their use. This could include social media platforms (like Twitter, Facebook), CRM systems (like Salesforce), or financial systems (like Stripe).
Files: These can be CSVs, Excel files, text files, etc. They could be stored locally or in the cloud (like on Amazon S3).
Streaming data: This is real-time data that is continuously generated by various sources. It could be IoT devices, social media streams, website clickstreams, financial transactions, etc.
Web: Web scraping is another way to source data, by extracting data from websites.
Legacy systems: Older, often proprietary systems in a company can also be sources of data.
Data warehouses and data lakes: These are repositories that store vast amounts of raw data in its native format until it's needed.

The collection process varies depending on the type of source. It could involve writing SQL queries to extract data from a database, using API calls to retrieve data from a software platform, implementing a web scraping tool, or setting up a real-time data pipeline.

The value of processing data at its source, also known as "edge processing," is considerable. First, it allows for the reduction of data that needs to be sent to the cloud or a central repository, which can save bandwidth and reduce costs. Second, it can increase speed and responsiveness, as data can be processed immediately without the latency of sending data back and forth. Third, it can improve data privacy and security, as sensitive data can be anonymized or processed locally without being transmitted.

Moreover, processing data at the source can allow for real-time or near-real-time insights. For instance, an IoT device or sensor can process and analyze the data it generates on the spot, enabling immediate reactions to changes or anomalies. This can be vital in industries like manufacturing or healthcare, where instant decisions based on real-time data can lead to improved operational efficiency, better patient outcomes, and even saved lives.

The Transformation Journey: ETL/ELT, Integration Tools, and Cloud APIs

Once data has been collected from various sources, it typically needs to be transformed into a format that can be easily understood and utilized. This transformation process might involve tasks like data cleaning, data mapping, data conversion, data merging, data enrichment, and more.

The goal of this step is to standardize and structure the data in a way that is suitable for analysis, thereby increasing its quality and usability. Without this step, the data might be too messy, diverse, or complex to derive meaningful insights from it.

Several tools and technologies facilitate this process:

ETL/ELT tools: ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) tools, such as Informatica, Talend, and Fivetran, help businesses extract data from various sources, transform it into a usable format, and load it into a data warehouse for analysis.
Data integration tools: Tools like MuleSoft, IBM InfoSphere, and Oracle Data Integrator provide a wide range of capabilities for integrating data from different sources and managing this data effectively.
Large Language Models (LLMs): LLMs like GPT-4 can help analyze, structure, and make sense of text data. They can be used to derive insights from unstructured data like customer reviews, social media comments, and more.
Data Warehouses: Solutions like Amazon Redshift, Google BigQuery, and Snowflake offer robust platforms for storing and analyzing large volumes of data.
Cloud APIs: These can be used to extract data from various cloud-based applications and platforms. Providers include AWS, Google Cloud, and Microsoft Azure.

Despite these tools, it is often impractical or impossible run them near source of the data. The reasons for this vary:

Resource-intensive: Data transformation can be a heavy task requiring significant computational resources. Many source systems aren't designed to handle such tasks alongside their primary functions.
Consistency: It can be more efficient and error-proof to transform data after it has been centralized, to ensure consistency across different data sources.
Complexity of transformations: Some transformations are complex and require the power and flexibility of specialized tools or environments.
Security and compliance: Some data may need to be anonymized or handled in specific ways for legal and compliance reasons before it can be transformed.
No availability: Some tools are simply only offered as SaaS solutions and cannot be used on premise and in all those places we need it.

So, while processing data at its source has advantages, it isn't always feasible, making the transformation step an essential component of the data integration process. Although we loose value of having the data processed locally and incur cost of delivery of the data into cloud solutions, the limitations are apparent.

Data Destination in Action: Industry-Standard Tools and Services

At the destination, the ultimate goal is to act on the data that has been integrated, transformed, and delivered. The tools and services at this stage help derive actionable insights from the data and apply these insights in meaningful ways. Some current industry-standard tools for acting on data at the destination include:

Business Intelligence (BI) Tools: BI tools like Tableau, PowerBI, Looker, and QlikView allow users to visualize, analyze, and report on the data. They provide dashboards, reports, and interactive exploratory environments, enabling users to gain insights and make data-driven decisions.
Data Analytics Platforms: Platforms like SAS, IBM SPSS, and RapidMiner offer advanced analytics capabilities, including predictive analytics, statistical modeling, and machine learning.
Machine Learning Platforms: Tools like Google's TensorFlow, Scikit-Learn, PyTorch provide the necessary libraries and tools to build, train, and deploy machine learning models that can act on the data.
Automation Platforms: Platforms like Zapier or IFTTT can automatically trigger actions based on certain data conditions. This can help automate many repetitive tasks.

The final data destination is where the data is acted upon, and it can vary depending on the context. Here are a few examples of what can be considered a final data destination:

Business Applications: This could be CRM systems (like Salesforce), marketing automation platforms (like Marketo), or ERP systems (like SAP). The data could be used to automate processes, drive personalized experiences, or inform strategic decisions.
Databases and Data Warehouses: The processed data might be stored in databases or data warehouses for future analysis and reporting. This could include SQL databases, NoSQL databases, or cloud-based data warehouses like Amazon Redshift or Google BigQuery.
Real-Time Systems: In some cases, the data might be fed into real-time systems for instant action. This could include IoT systems, real-time bidding platforms in digital advertising, or fraud detection systems in banking and financial services.
Machine Learning Models: The data could be used to train machine learning models, which can then be used to make predictions, categorize data, or detect anomalies.
Human Users: Sometimes, the final destination might simply be a human user, who reviews the data in a dashboard or report and makes decisions based on it.

The concept of a final data destination isn't fixed; it depends on the specific use case, business goals, and the infrastructure in place. The key is that the data, once it reaches this destination, should provide actionable insights or enable more informed decision-making.

Acting on Data: Manual vs. Automated Processes

Once data is collected, transformed, and integrated, it becomes a potent resource that businesses can use in a variety of ways. The actual usage can be split into two main categories: manual and automated.

Manual Usage: Manual usage typically involves human interaction with the data. Examples of manual usage include:
- Dashboards: Data visualization tools like Tableau, PowerBI, and Looker allow users to interact with the data through dynamic dashboards. They can slice and dice the data in different ways to gain insights and make data-driven decisions. For instance, a retail business might use a dashboard to track sales performance across different regions and products.
- Reports: Periodic reports can provide a snapshot of business performance over a particular time frame. An executive might review a monthly report to understand trends in customer behavior, product sales, or market dynamics.
- Notifications/Alerts: Some systems can generate alerts based on predefined criteria. For example, a stock trading firm might set up alerts when certain stocks hit predefined price levels.
Automated Usage: Automated usage involves systems directly utilizing the data without human intervention. Examples of automated usage include:
- API Integration: Applications can use APIs to pull in relevant data and use it in real-time. For instance, a travel booking website might integrate with airline APIs to fetch and display real-time flight information.
- Real-time Protocols: Protocols like MQTT (used in IoT devices) allow for real-time exchange of data. A home automation system might use such protocols to control devices based on sensor data (e.g., turning on the AC when room temperature crosses a threshold).

The delivery of data to its destination can be a complex process, depending on the destination's nature and requirements.

For instance, delivering data behind firewalls or in Virtual Private Clouds (VPCs) might require secure tunneling protocols or specific network configurations. Data providers might need data to be delivered in certain formats or via specific protocols. Intermittent connectivity, which might be an issue in remote areas or underdeveloped regions, might require the ability to store and forward data when connectivity is restored.

Here are a couple of examples of how this data is applied in the virtual and physical world:

Virtual World: A major e-commerce platform may use data to provide personalized product recommendations to its customers. This involves collecting data about users' browsing and purchasing behavior, processing and transforming this data to derive insights about users' preferences, and then delivering these insights to the application in real time. The application can then use these insights to provide personalized recommendations, thereby improving user engagement and potentially boosting sales.
Physical World: An automated manufacturing plant may use data to optimize its operations. IoT sensors spread across the plant collect data about various parameters (like machine temperature, production speed, etc.). This data is processed at source (edge processing) or in the cloud, and insights derived from this data are used to control the plant's operations in real time. For instance, if a machine's temperature crosses a certain threshold, it could be automatically switched off to prevent damage.

In both these examples, the process of data integration - collecting data from various sources, processing and transforming it, and delivering it to where it's needed - is fundamental to deriving actionable insights and value from the data.

A Single Platform for Distributed Process Orchestration: The Future of Data Integration

A unified platform that combines a self-serve execution environment, data transmission and messaging layer, central orchestration service, and program store could revolutionize the field of data integration and processing. This platform would essentially serve as a virtual mesh, providing the infrastructure needed to deploy, manage, and scale data integration processes across a wide range of environments. Here's how such a platform might work:

Self-Serve Execution Environment: This component of the platform would provide a runtime environment where users could deploy and run lambda functions - small, single-purpose anonymous blocks of code. It would support execution environments across diverse hardware, from microcomputers to large server clusters in the cloud.
Data Transmission and Messaging Layer: This layer would handle the secure and reliable transmission of data across the network. It would also support messaging protocols for inter-process communication, allowing different parts of the system to interact smoothly.
Central Orchestration Service: The orchestration service would manage the lifecycle of the lambdas, handling tasks like deployment, scaling, failover, and more. It would ensure that the right code runs at the right time and place, based on predefined rules and triggers.
Program Store: This would be a central repository where the lambdas are stored and versioned. It would also serve as a marketplace where users can share and reuse lambdas, fostering a culture of code reuse and collaboration.

This platform, due to its distributed nature, could run on a wide variety of devices, both in terms of size (microcomputers to server clusters) and type (edge devices, on-premise servers, cloud servers). Despite the distribution of the environment, the central orchestration service would provide a single point of control, allowing administrators to manage and oversee the entire system from a central dashboard.

Such a platform could radically streamline the process of data integration. By deploying lambdas at various points in the data journey, businesses could collect, process, and utilize data in real-time, without the need to transfer the data to a central location for processing.

For example, a lambda deployed on an IoT device could preprocess the data at source, another lambda running in the cloud could enrich and transform the data, and yet another lambda running on an application server could consume the data and use it to personalize the user experience. All this could be achieved with low latency, high efficiency, and fine-grained control, courtesy of the distributed nature of the platform.

This unified platform would provide the tools and infrastructure needed to build a virtual data integration mesh, with endpoints distributed across all stages of data processing. By bringing the code to the data, rather than the other way round, it would address many of the challenges associated with traditional data integration and unlock new possibilities for real-time, data-driven decision-making.

Introducing the Scramjet Cloud Platform: A Revolution in Data Integration

The Scramjet Cloud Platform represents a new era in end-to-end data integration. It offers a comprehensive solution that streamlines the data journey, from source to transformation, all the way up to the final destination and actioning based on data in real-time.

The first step in the data journey involves data collection at the source. With the Scramjet Cloud Platform, you can run a runtime supervisor called Scramjet Transform Hub right within your environment. This provides you with the capability to connect to your internal resources without exposing them to the Internet, enabling secure and efficient data collection directly at the source.

The collected data then undergoes transformation, a process that involves structuring, enriching, and formatting the data to make it actionable. The platform excels here as well, allowing you to execute long-lived lambda functions, known as Sequences, which can be used for these data transformation tasks. The Sequences run in a distributed manner, allowing for localized data processing and real-time transformations.

Once the data has been transformed, it moves on to the destination, where it is ultimately acted upon. The Scramjet Cloud Platform empowers you to facilitate this final stage seamlessly, offering a way to run actions in real-time based on the data received. It does this through Topics, streamed message queues that directly connect different Sequences, effectively enabling real-time data exchanges and actions.

Additionally, Scramjet's inner APIs are available to sequences, providing control over the local node through the HubClient and the entire platform through the SpaceClient. This gives users complete control over the data integration process, from end to end, across all environments connected to a single Space.

In conclusion, the Scramjet Cloud Platform is reshaping how we approach data integration. It goes beyond the traditional concept of 'dashboards' and 'cloud APIs' by providing a unified platform that integrates all necessary services across any number of environments through the cloud. This empowers businesses to handle their data more effectively, securely, and efficiently, making the most out of every data point, and enabling real-time actions based on the data. It's a revolutionary platform for a data-driven future.

Implementing the Data Mesh Paradigm with the Scramjet Cloud Platform

Now, let's assess its alignment with the Data Mesh principles and the broader context of data integration:

Domain Ownership: By enabling runtime supervisors to operate within the user's environment and directly interact with internal resources without exposing them to the Internet, Scramjet Cloud Platform upholds the principle of Domain Ownership. It allows decentralized data domains to manage their data products while still being part of a holistic data ecosystem.
Self-Serve Data Infrastructure: The platform's Topics, which connect different Sequences, facilitate the seamless flow of data across the mesh, embodying the Self-Serve aspect of the Data Mesh. This empowers teams to access and utilize the data they need in real-time, without relying on centralized data teams.
Federated Governance: The availability of inner APIs, through HubClient and SpaceClient, allows for a distributed control mechanism over the local node and the whole platform. This aligns with the principle of Federated Governance, distributing decision-making while maintaining an overarching coherence across the data ecosystem.
Data as a Product: By enabling Sequences to produce, transform, or consume data, and making this data accessible across all connected Hubs, Scramjet treats Data as a Product. This is in line with the Data Mesh principle that emphasizes the value of data as an asset and product in its own right.

Overall, the Scramjet Cloud Platform aligns well with the Data Mesh principles and addresses many of the challenges associated with traditional data integration approaches. It provides a platform that supports real-time data processing and integration, fostering decentralized data ownership, and offering comprehensive control over data workflows. By enabling data to be treated and managed as a product, it can potentially unlock more effective, efficient, and meaningful utilization of data.

Conclusion

In the ever-evolving world of data, the ability to effectively integrate and act upon data in real-time is no longer a luxury, but a necessity. From the complexity of connecting distributed environments to the challenges of delivering data to its final destination, the data journey presents several hurdles. The need of the hour is a solution that not only handles data collection and transformation effectively but also empowers businesses to action on the data seamlessly.

Scramjet Cloud Platform is at the forefront of this revolution, offering a solution that aligns perfectly with the principles of the Data Mesh paradigm. By allowing companies to securely manage and process data within their own environment, enabling real-time transformations, and facilitating direct communication between different sequences, Scramjet Cloud Platform provides a unified data integration solution.

The introduction of such an innovative platform redefines the data journey, turning it from a series of disconnected steps into a smooth, end-to-end process. With Scramjet Cloud Platform, businesses are no longer restricted to dashboards and cloud APIs. Instead, they gain a comprehensive control system that integrates all necessary services across multiple environments.

By providing a way for data to be treated and managed as a product, Scramjet unlocks more effective, efficient, and meaningful utilization of data. It represents not just the future of data integration, but a paradigm shift in how businesses view and use their data. The Scramjet Cloud Platform is truly the gateway to a data-driven future.

Project co-financed by the European Union from the European Regional Development Fund under the Knowledge Education Development Program. The project is carried out as a part of the competition of the National for Research and Development: Szybka Ścieżka.