Implementation of data lake for CoBotAGV readings in

the cloud environment

Piotr Grzesik, Paweł Benecki, Daniel Kostrzewa, Dariusz Mrozek, Bohdan Shubyn

Agenda

  1. High-level overview of current data storage architecture
  2. Details about specific parts of the cloud architecture
  3. Implementation with "Infrastructure as code" approach with Terraform
  4. Potential services for analytics

High-level diagram

Azure Data Lake Storage Gen2

A centralized, single-storage platform for data ingestion, processing and visualisation. Massively scalable, according to documentation it can handle exabytes of data, with throughput at gigabites per second. It supports hierarchical namespaces, that allow for efficient data access. It can be integrated with multiple analytical frameworks and offers Hadoop compatible access. 

Azure IoT Hub

Service that allows for secure and reliable communication between cloud and IoT devices. It supports management of specific devices, authentication and authorization, and integrates with services such as Azure Stream Analytics or Azure Data Lake Storage. Additionally, it can be enhanced with Azure IoT Edge to deploy services directly at edge devices. 

Azure Stream Analytics

Service that is a fully managed stream analytics engine, designed to process large volume of streaming data. It can be used to enrich the data, preprocess it, or discard invalid events. It can also be integrated with Azure Functions or Azure Machine Learning, to enable e.g. anomaly detection on incoming data streams. Azure Stream Analytics jobs can also be executed on edge devices.

Azure Event Hub

Azure Event Hub is a generic event ingestion service. It support multiple source and outputs, natively integrates with services such as Azure Functions. It supports three protocols for consumers and producers - AMQP, Kafka, and HTTPS. It also supports data Capture to save data to Azure Data Lake Storage for long-term retention. 

Azure Time Series Insights

Set of services that allow for ingesting, storing, processing, organizing, and visualizing time series data. It is optimized for data coming from IoT devices. It supports warm and cold storage for both interactive and historical analysis. It can also be integrated with other services such as Azure Machine Learning, Azure Databricks for further analysis of stored data. Unfortunately, this service will be no longer available in 2025. 

Infrastructure as Code

Infrastructure as Code (IaC) is a concept of defining and managing cloud infrastructure with configuration rather than manual interaction with GUI or via CLI. It allows to define and deploy repeatable cloud infrastructures, while at the same time providing a definition and overview of all your services. 

Terraform

Terraform is an IaC tool that allows for managing infrastructure across multiple cloud providers such as Microsoft Azure, Amazon Web Services, or Google Cloud Platform. It uses a human-readable language for definitions of resources, it records state to track changes across deployments. Its configuration can be commited to version control systems to provide an audit trail of changes to your infrastructure. 

Terraform config

resource "azurerm_eventhub" "eventhub" {
  name                = var.eventhub_name
  namespace_name      = azurerm_eventhub_namespace.eventhub_namespace.name
  resource_group_name = azurerm_resource_group.rg.name
  partition_count     = 1
  message_retention   = 1

  capture_description {
    enabled = true
    encoding = "Avro"
    interval_in_seconds = 300
    destination  {
      name = "EventHubArchive.AzureBlockBlob"
      blob_container_name = azurerm_storage_container.storage_container.name
      storage_account_id = azurerm_storage_account.storage_account.id
      archive_name_format = "{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}"
    }
  }
}

Potential services for data analytics

  • Azure Databricks - processing data with Apache Spark
  • Azure Data Lake Analytics - parallel data transformation and processing in serverless manner with U-SQL
  • Azure Synapse Analytics - combination of analytics on data from data lakes and data warehouses
  • Azure HDInsight - platform to provision Hadoop, Spark, Storm clusters
  • Azure Data Explorer - managed data analytics service for real-time analysis on large volumes of streaming data 

Implementation of Data Ingestion Pipeline for CoBotAGV readings in edge-cloud environment

By progressive

Implementation of Data Ingestion Pipeline for CoBotAGV readings in edge-cloud environment

  • 128