Data processing in the IT industry is quickly evolving, even if many of the legacy projects and enterprise systems do not reflect it yet. Streaming revolution is at the door and there are good reasons for it, as data streams are everywhere and offer better speed, agility, and quality of data processing. There are more and more sources that generate data continuously at high speed and should be processed continuously as well.
IT architectures are also evolving, with concepts such as event-driven architecture (EDA) that offers better flexibility, resilience, and loose-coupling of the included software components.
Let’s walk together through the world of streaming data and see various examples of data available as streams, both in B2C and B2B solutions.
Underneath the browsers, the operating system, local storage drives, memory cards, graphics data, audio... well anything that's not on the CPU itself, computers treat pretty much everything as streams. When you load an image from the internet, it’s being sent to you in fragments, small 16x16 pixel images that form the whole JPG file. When you copy a file, it’s being sent in fragments. When you open a website, the HTML file we’re getting is being loaded in small fragments. When we’re dealing with fragments, we see the first sign of dealing with a stream.
Fragments alone do form a stream but to be able to say, “this is streaming,” we need to add one important word here: order. A stream is an ordered set of fragments being delivered or processed in sequence. This means that when you load a webpage, the HTML file is being sent to you in small pieces, every piece is being read, then the browser checks if images, styles, or scripts need to be downloaded, and it makes sure that it’s got enough to go on.
Now a streaming has one last trick up its sleeve: backpressure. Now when the browser reads that HTML file and loads all the tags and sees the scripts, it has the means to tell the server “Hold on in sending me this file, I need to do some processing of the fragments I already received.” When it finally loads the scripts there, it says to the server “hey, keep sending me the data, I’m now ready for more.” This is the most important part of stream processing – thanks to this, your computer can process files much bigger than your memory. Just think how a video file weighting approximately 40 GB could play on your 4GB phone without this!
Streams are not new and are present in our everyday lives for dozens of years. Data streams started as analog streams. Morse code transmitted using telegraphy, analog radio waves transmitting voice, and TV waves transmitting video + voice are examples of streams used even before the Internet and World Wide Web was born. Because they were analog, back then they were usually encoded and decoded using hardware dedicated for this purpose.
Nowadays, these analog streams are sometimes still available but are also being replaced by digital data formats.
Video streams are interesting from the information technology point of view, as they are backed up by massive and scalable infrastructure. This infrastructure at the same time must support hundreds of millions of users watching videos in HD or even 4k, thus streaming gigabytes of data per hour per individual user. Each user can pause streaming, resume it, change the frequency or even change the device where he watches the stream and the whole VoD platform must support that instantly.
Other interesting sources of video streams include surveillance cameras, smart devices, live games streams, e-sport events, citizen journalism with streams from mobile phones, and many more.
Video streams, like any other stream, can be processed in real-time. You can try to perform live object detection in the video frames using AI/ML toolkits, you can convert, compress, optimize, resample these video streams, extract specific frames from them.
Voice streams are party-like video streams, just video streams usually transmit graphic and voice data at the same time. Voice streams transmit encoded various sounds such as human voices, music, ambient sounds. You can find voice streams in consumer-related products and services such as music streaming, internet radios, podcasts.
Voice streams can be processed too. Converting and compressing applies here, and the latest advances in AI/ML processing allow for such advanced use cases as live transcriptions; extracting voice from the noisy background, or, vice versa, generating voice stream by transforming text into the live stream of artificially generated voice closely resembling human voice.
Game streams deserve their category in modern times. The game industry is already enormous, and millions of users play games globally every day. These games, especially online and multiplier ones, generate a massive amount of clickstream data regarding gameplay, human behavior, and chat messages between users that are often located in various countries. Gamers must react to game events in real-time, which is especially evident in the e-sport area where even a fraction of the seconds counts.
These harsh requirements also put enormous pressure on games platforms and infrastructure, however sometimes in different areas than for video streams. Video streams neither have to react to users’ reactions immediately nor do they have to synchronize game state between various users using real-time processing. Game platforms handling games streams must do that with low latency and high performance.
Text streams are various streams, that send text content of heterogenous types, and this text is not encoded to any binary format. These could be streams of social platforms messages and statuses, chat messages, series of messages from sports events (like the description of events on the playground).
Various NLP (Natural Language Processing) techniques can be applied here. Smart algorithms can extract meaningful data from these streams. In the case of chatbots, this could be translated to intents representing actions and measures of behavior expected by the user from the chatbot (“I want to order a pizza”). Texts can be translated real-time to another language, or even interpreted and used for autonomous software agents to perform meaningful actions (like buying stocks based on customer data or text messages from social media platforms or search queries).
The streams covered above is what we normally think of when we hear “data streaming”. We just get “some data” that we need to process and later we get “some more”. Yet in all the examples above we think of this as part of a large number of bytes, characters, etc. Event streams underneath use the same way of sending the data, but they have one important aspect – their fragments stand for something that has happened either in real life or virtually.
What those fragments could be? Well, maybe a push of a bell in a store, a click on a website, a thermometer registering a temperature, an error in a program, a log entry, search queries. There’s a couple of samples underneath, but let’s focus on implications here – what does that change?
What event streams change is quite simple: the fragments in data streams do not have a boundary, meaning you can read arbitrarily let's say 10 bytes and you do have a fragment. In case of event streams, you do need to read a specific number of bytes that belong to this specific event and not the bytes that belong to the next one. So, in case of the pressing of the button, you need to read both which of the buttons was pressed, in which store and where it is located, before you start reading the next event.
Transactions streams contain various transactions such as credit card purchases, bank transfers, online advertisements purchase on real-time bidding platforms, or e-commerce purchases. What is interesting about those streams is that the stream event format is usually standardized and transmits the same set of information for each event. This enables more accurate stream processing and analysis as well as defining concrete actions, such as accessing customer data or blocking payment cards in case of suspected fraudulent transaction data.
These types of streams can also transmit information about transactions made on the market, such as stock exchange or forex information.
Internet and the World Wide Web currently generate a massive amount of information every day. This area can intersect with transaction streams, but not every data point from an e-commerce website or web portal is a transaction. Many of them reflect users’ behaviors and can be used to personalize offers, display ads, gather data for analytics.
Let's imagine a modern web portal visited by 1 000 000 users each day. Each average user makes 100 actions per day (page views, page scrolls, advertisement clicks, link hovers, etc). In total, these users will generate 100 000 000 (one hundred million) data points daily, so 1157 data points per second. Processing of such streams to adjust the content, personalize advertisements requires extremely performant and robust software solutions.
Log streams usually contain a series of recordings of different events on given hardware or software. They could represent the state of the running entity and actions ended in success or error and are collected continuously as hardware or software is running. Log streams can be used both to react to errors, as well as monitor the overall state of the server/Virtual Machine, operating system, enterprise platform, robot. We can not only react to concrete events but even try to predict future machine failures using artificial intelligence.
Smart devices can produce a constant stream of various events from their sensors that analyze the external environment. These could be health measures from the smartwatch (“heart rate”), alarm events from factory monitoring devices, air quality, weather data from weather stations, readings from drones monitoring specific areas.
From the IoT platform that needs to process these streams, important is that it needs to manage data streams from multiple, frequently hundreds of thousands of devices sending data in real-time.
Logistics and transport generate a massive amount of data, including data about geolocation, state, directions of vehicles, routes, distances, speed. The vehicle world is enormous, so we can have both streams of data from autonomous driving vehicles as well as data from the truck fleet, or even a stream of data from airplanes flying between airports.
Transportation and logistics streams can be used to optimize routes and decrease fuel usage, improve transportation security, monitor vehicle state and location.
There are high-tech solutions ideas; or products being invented now, that will offer data streams not widely used before. These can include:
We hope that this article has proven to you that the world of streams is very rich and technological future will bring more and more interesting streaming data examples to the table.
Stream processing is one of the key features of Scramjet Transform Hub and the upcoming Scramjet Cloud Platform. In Scramjet, we believe that real-time data processing based on streams is a future for many solutions, both business and consumer-oriented.