Implementation of Data Ingestion Pipeline for CoBotAGV readings in edge-cloud environment

Piotr Grzesik, Paweł Benecki, Daniel Kostrzewa, Dariusz Mrozek, Bohdan Shubyn

Agenda

Fetching data from OPC UA Server
Location of OPC UA Client
High-level overview of current data storage architecture
Details about specific parts of the cloud architecture
Implementation with "Infrastructure as code" approach with Terraform

Fetching data from OPC UA

Subscription-based approach

Fetching data from OPC UA

Subscription-based approach

Pros:

No unnecessary calls to OPC UA Server
Instant granular updates exactly when variable changes
Persisted changes in local db in case of failures

Cons:

More complex architecture and need for aggregation of single variable updates to project state of the system
Need to maintain state of the system locally as a part of client

Fetching data from OPC UA

Subscription-based approach with incremental updates

Fetching data from OPC UA

Subscription-based approach with incremental updates

Pros:

No unnecessary calls to OPC UA Server
Instant granular updates exactly when variable changes
Simple implementation of OPC UA Client

Cons:

More complex processing architecture on the backend
Needs failover scenario

Fetching data from OPC UA

Periodic fetch approach

Fetching data from OPC UA

Periodic fetch approach

Pros:

Simpler architecture
No need to maintain local state as part of the OPC UA Client

Cons:

Unnecessary calls to OPC UA Server when no changes are observed
Needs failover scenario

Placement of OPC UA Client

Co-located with AGV and OPC UA Server

Placement of OPC UA Client

Co-located with AGV and OPC UA Server

Pros:

Lower latency between OPC UA Server/Client
Better resilience to outages in Internet connection

Cons:

More complex architecture at the edge
More challening maintenance of OPC UA Client at the edge
Need to create multiple clients if there are AGVs that are not co-located

Placement of OPC UA Client

Located in Cloud environment

Placement of OPC UA Client

Located in Cloud environment

Pros:

Simpler architecture
Can support data ingestion from multiple OPC UA Servers

Cons:

Higher communication latency between OPC UA Server/Client
Less resilience to Internet connection outages

High-level diagram - cloud part

Azure Data Lake Storage Gen2

A centralized, single-storage platform for data ingestion, processing and visualisation. Massively scalable, according to documentation it can handle exabytes of data, with throughput at gigabites per second. It supports hierarchical namespaces, that allow for efficient data access. It can be integrated with multiple analytical frameworks and offers Hadoop compatible access.

Azure IoT Hub

Service that allows for secure and reliable communication between cloud and IoT devices. It supports management of specific devices, authentication and authorization, and integrates with services such as Azure Stream Analytics or Azure Data Lake Storage. Additionally, it can be enhanced with Azure IoT Edge to deploy services directly at edge devices.

Azure Stream Analytics

Service that is a fully managed stream analytics engine, designed to process large volume of streaming data. It can be used to enrich the data, preprocess it, or discard invalid events. It can also be integrated with Azure Functions or Azure Machine Learning, to enable e.g. anomaly detection on incoming data streams. Azure Stream Analytics jobs can also be executed on edge devices.

Azure Event Hub

Azure Event Hub is a generic event ingestion service. It support multiple source and outputs, natively integrates with services such as Azure Functions. It supports three protocols for consumers and producers - AMQP, Kafka, and HTTPS. It also supports data Capture to save data to Azure Data Lake Storage for long-term retention.

Azure Time Series Insights

Set of services that allow for ingesting, storing, processing, organizing, and visualizing time series data. It is optimized for data coming from IoT devices. It supports warm and cold storage for both interactive and historical analysis. It can also be integrated with other services such as Azure Machine Learning, Azure Databricks for further analysis of stored data. Unfortunately, this service will be no longer available in 2025.

Infrastructure as Code

Infrastructure as Code (IaC) is a concept of defining and managing cloud infrastructure with configuration rather than manual interaction with GUI or via CLI. It allows to define and deploy repeatable cloud infrastructures, while at the same time providing a definition and overview of all your services.

Terraform

Terraform is an IaC tool that allows for managing infrastructure across multiple cloud providers such as Microsoft Azure, Amazon Web Services, or Google Cloud Platform. It uses a human-readable language for definitions of resources, it records state to track changes across deployments. Its configuration can be commited to version control systems to provide an audit trail of changes to your infrastructure.

Terraform config

resource "azurerm_eventhub" "eventhub" {
  name                = var.eventhub_name
  namespace_name      = azurerm_eventhub_namespace.eventhub_namespace.name
  resource_group_name = azurerm_resource_group.rg.name
  partition_count     = 1
  message_retention   = 1

  capture_description {
    enabled = true
    encoding = "Avro"
    interval_in_seconds = 300
    destination  {
      name = "EventHubArchive.AzureBlockBlob"
      blob_container_name = azurerm_storage_container.storage_container.name
      storage_account_id = azurerm_storage_account.storage_account.id
      archive_name_format = "{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}"
    }
  }
}

Potential services for data analytics

Azure Databricks - processing data with Apache Spark
Azure Data Lake Analytics - parallel data transformation and processing in serverless manner with U-SQL
Azure Synapse Analytics - combination of analytics on data from data lakes and data warehouses
Azure HDInsight - platform to provision Hadoop, Spark, Storm clusters
Azure Data Explorer - managed data analytics service for real-time analysis on large volumes of streaming data

Implementation of Data Ingestion Pipeline for CoBotAGV readings in edge-cloud environment

By progressive

Implementation of Data Ingestion Pipeline for CoBotAGV readings in edge-cloud environment