The internet of things is everywhere, and the data is piling up. The new tiny, embeddable computers are energizing managers and engineers with the possibility of using all of this data to control everything from industrial plants to personal homes. The database administrators, though, aren’t as thrilled because they are expected to gather, store, and analyze this often unceasing firehose of bits.
Some programmers and DBAs are developing pipelines that can accept, analyze, and store the important bits. These so-called streaming databases are tools designed to handle both the unstoppable incoming flow as well as the endless queries from tools that want to make decisions based upon the data.
What is a streaming database?
Streaming databases are close cousins to other new classes of tools like time-series databases or log databases. All are designed to track a series of events and enable queries that can search and produce statistical profiles of blocks of time. The streaming databases can respond to queries for data and also statistics about the data, generate reports from these queries, and populate all of the dashboards that track what’s happening to allow the users to make smart decisions about the telemetry.
The tools are essentially pipelines that start out analyzing the incoming data flow and end up storing aggregated data in a database that’s easily queried. Some think of the streaming database as the entire system, and some imagine that the system is created by attaching the pipeline to a more traditional database. In both cases, the entire system is ready to answer questions.
Some good examples of important use cases include:
- Time-critical services like Uber or Lyft
- Continuous industrial processes like assembly lines
- Software for monitoring video or other sensors while searching for anomalies
- Scientific experiments that must be constantly analyzed
- Supply chains
How the data is split
The data inside is often split, at least philosophically, into two tiers. The raw input, often called “streams,” are immutable, append-only sequences of events. They’re meant to be a historical record of what happened and when.
The second tier is built from watching the streams and constructing summaries, often statistical, about the events. They might, for instance, count the number of times that an event happened each day over the last month or find the average value over each week in a year. The analysis is usually stored in tables that are often similar in structure and behavior to traditional relational databases. Indeed, it’s not uncommon for developers to connect a traditional database for these results.
Some streaming databases are designed to dramatically reduce the size of the data to save storage costs. They can, say, replace a value collected every second with an average computed over a day. Storing only the average can make long-term tracking economically feasible.
Streaming opens up some of the insides of a traditional database. Standard databases also track a stream of events, but they’re usually limited to changes in data records. The sequence of INSERTs, UPDATEs, and DELETEs are normally stored in a hidden journal or ledger inside. In most cases, the developers don’t have direct access to these streams. They’re only offered access to the tables that show the current values.
Streaming databases open up this flow and makes it simpler for developers to adjust how the new data is integrated. Developers can adjust how the streams from new data are turned into tabular summaries, ensuring that the right values are computed and saved while the unneeded information is ignored. The opportunity to tune this stage of the data pipeline allows streaming databases to handle markedly larger datasets.
How are traditional databases adapting?
The traditional databases are finding a role in streaming applications, but usually as a destination that lies downstream. The data flows through another tool that analyzes it and generates more concise values for more permanent storage in a traditional database. The legacy software and reporting tools can work easily with it.
For instance, Oracle Streams can be deployed either as a service or as on-premises installation. It will gather and transform data from a variety of sources and then deposit it with other services that can include their own databases. The message format is designed to be compatible with Apache Kafka, an open standard, allowing it to be integrated with other Kafka applications.
IBM’s product, also called Streams, emphasizes the analytical power of the pipeline integrated with some of the machine learning products. It is also compatible with Kafka and can deposit the results in a number of destinations, including IBM’s own data warehouses.
What about the upstarts?
New companies are tackling the challenge by either building entirely integrated tools or simply creating a stream-handling layer that works with existing databases. Those that integrate with established infrastructure can leverage all of the other compatible tools, while the entirely new versions have the advantage of building everything from scratch.
Many of the tools that integrate with existing databases are built on Apache’s Kafka, an open source message handing framework that’s often used to link together multiple software packages. Kafka itself handles the chores of buffering and delivering the messages containing the events. This buffering, incidentally, requires storing the stream of events, making Kafka a kind of very basic database that eventually delivers the data to another.
Equalum, for instance, offers a tool for transforming a data stream en route to a data warehouse or data lake using more traditional databases. It’s built upon an open source foundation of Apache’s Kafka and Spark and offers a simplified, visual coding framework that allows the data pathway to be defined as a flowchart.
Developers who enjoy working in SQL will appreciate ksqlDB, a tool for ingesting and storing data that uses a version of SQL to specify major tasks. “Use a familiar, lightweight syntax to pack a powerful punch,” the sales literature promises. “Capture, process, and serve queries using only SQL. No other languages or services are required.” The tool is tightly integrated with Kafka to ensure it is simpler to install in existing applications that use it.
Amazon calls its major offering Kinesis and offers special, preconfigured pathways for working with video feeds. It’s integrated with some of AWS’ AI tools like the Rekognition for video analysis and SageMaker for basic machine learning.
Others are starting to build open source projects like Debezium that can transform data from event streams managed by Kafka or other pipelines.
Is there anything streaming databases can’t do?
In many ways, streaming databases are just supersets of the traditional model. If you think of the standard INSERTs and DELETEs as events, then any of the standard applications can be handled by the streaming cousins. Much of the overhead, though, may be wasted if the application does not require constantly evolving analysis
Many streaming databases also offer fewer of the traditional functions or APIs because their first job is taming the endless flow of data. They may not offer the complex views or elaborate joins, at least not for the incoming data. If the results are stored in a more traditional relational database, it will have all of the features associated with it.