Forecast “future” with TDengine

Intro

TDengine™ is an open-source, cloud-native time series database (TSDB) optimized for the Internet of Things (IoT), Connected Cars, and Industrial IoT. It enables efficient, real-time ingestion, processing, and monitoring of petabytes of data per day, generated by billions of sensors and data collectors.

There are many users storing massive data generated by IoT devices, cars, or IT infrastructures into TDengine in real-time. They use standard SQL commands to query the data from TDengine. TDengine supports filter, group, windowed, join, and many aggregation functions to query data. It facilitates users to query data according to their purpose.

Many users also want to get more interested in existing data. For example, what would happen in light of the current trend? With the AI era coming, there are many new technologies or methods emerging in recent years, such as new machine learning, and deep learning algorithms. Is it possible to use machine learning and deep learning algorithms to forecast the future with the data stored in the TDengine cluster?

Fortunately, TDengine supports multiple popular programming language connectors like Java, Python, Go, Rust, C#, NodeJS, etc. Users can use their favorite language connector to access TDengine. The connectors provide interfaces that conform to the specification. That makes connectors easily integrated with other software or frameworks.

This article introduces how to forecast future data with existing data stored in the TDengine. We will mock some test data to reflect a real power system. And demonstrate how to use TDengine and a few Python libraries to forecast how the data will be in the next year.

Suppose the user is an electrical power system company. The user collects the power consumption number from the electricity station meter each day and stores it in a TDengine cluster. Now the user wants to forecast the power consumption will be and to purchase more devices to support it.

As the economy grows, electricity consumption increases by a certain percentage each year. In addition, considering the seasonal variations, electricity consumption varies accordingly. This city is located in the northern hemisphere, so many households use more electricity in the summer. We simulate data to reflect these assumptions

The source code is hosted at https://github.com/sangshuduo/td-forecasting.

Demonstration

Step 1: Deploy TDengine and run the TDengine server on your system

Please refer to the official document https://docs.tdengine.com/get-started/ for detailed instructions.

Step 2: Clone source code

1	git clone https://github.com/sangshuduo/td-forecasting

Step 3: Install the required Python packages

# if you are using Ubuntu 20.04 Linux
sudo apt install python3-pyqt5

# Optional, if PyQT5 fails
sudo apt-get install libxcb-xinerama0

python3 -m pip install -r requirements.txt

Please note minimal required Python version is 3.8.

Step 4: Mock some data

1	python3 mockdata.py

Step 5: Forecast next year’s data

1	python3 forecast.py

Result

If everything is okay, the figure will show up like below.

How does it work?

mockdata.py

def insert_rec_per_month(conn, db_name, table_name, year, month):
    increment = (year - 2014) * 1.1
    base = int(10 * increment)
    if month < 10 and month > 5:
        factor = 10
    else:
        factor = 8
    for day in range(1, monthrange(year, month)[1] + 1):
        num = base * randint(5, factor) + randint(0, factor)
        sql = f"INSERT INTO {db_name}.{table_name} VALUES ('{year}-{month}-{day} 00:00:00.000', {num})"
        try:
            conn.execute(sql)
        except Exception as e:
            print(f"command: {sql}")
            print(e)

The core function of mockdata.py is mocking random data with a few adjustments to fulfill the assumption.

forecast.py

The forecast.py implements the forecasting function.

Step1: Import necessary modules for prediction

import argparse

import lightgbm as lgb
import matplotlib.pyplot as plt
import mlforecast
import pandas as pd
from mlforecast.target_transforms import Differences
from sklearn.linear_model import LinearRegression
from sqlalchemy import create_engine, text

Here explain what they are and what they are for.

Lightgbm is a Python module that supports the LightGBM algorithm, a gradient-boosting framework that uses tree-based learning algorithms.
Matplotlib is one of the most popular Python modules for visualization.
Mlforecast is a framework to perform time series forecasting using machine learning models.
Pandas is the most popular module to support data manipulation.
Sklearn is a module that supports popular data science/machine learning algorithms.
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.

Step 2: Connect to TDengine and query data

engine = create_engine("taos://root:taosdata@localhost:6030/power")
conn = engine.connect()
print("Connected to the TDengine ...")
df = pd.read_sql(
    text("select _wstart as ds, avg(num) as y from power.meters interval(1w)"), conn
)
conn.close()

TDengine Python connector complies with Python Database API Specification v2.0 (PEP 249). DBAPI is shorthand for the phrase “Python Database API Specification”. This is a widely used specification within Python to define common usage patterns for all database connection packages. The DBAPI is a “low level” API which is typically the lowest level system used in a Python application to talk to a database. SQLAlchemy’s dialect system is constructed around the operation of the DBAPI, providing individual dialect classes which service a specific DBAPI on top of a specific database engine.

We can use SQLAlchemy to connect the TDengine cluster and use Pandas to query data to data frame format.

Here we suppose the user cares about weekly average power consumption instead of daily consumption to reduce the unusual value. We can use AVG() function and INTERVAL(1w) clause command to query the data from the TDengine cluster.

We will manipulate the data in data frame format later.

Step 3: Forecasting

df.insert(0, column="unique_id", value="unique_id")

print("Forecasting ...")
forecast = mlforecast.MLForecast(
    models=[LinearRegression(), lgb.LGBMRegressor()],
    freq="W",
    lags=[52],
    target_transforms=[Differences([52])],
)
forecast.fit(df)

predicts = forecast.predict(52)

pd.concat([df, predicts]).set_index("ds").plot(figsize=(12, 8))

By the feat of the mlforecast module, we can do the forecasting with few parameters. Here we use linear regression and LightGBM to show their results in the same graph to visualize what the different algorithms made.

Step 4: Show up or dump to file

if args.dump:
    plt.savefig(args.dump)
else:
    plt.show()

The Python code provides an argument –dump to allow users to decide to dump the result into a picture for post-processing or to show up the result on the screen immediately.

The above steps are verified on Ubuntu 20.04, Ubuntu 22.04, Windows 10, and macOS.

Conclusion

And that’s it! We now have a very simple program to demonstrate how to use TDengine to predict power meter numbers using power system history data stored in TDengine.