Distributed data transformation streaming cluster
---
storm_clusters:
stormcluster-01:
nimbus-stormcluster.gobalto.com:
inet: 10.110.20.7
roles: [nimbus, logviewer, ui]
nimbus2-stormcluster.gobalto.com:
inet: 10.110.20.146
roles: [nimbus, logviewer, ui]
supervisor1-stormcluster.gobalto.com:
inet: 10.110.20.10
roles: [supervisor, logviewer]
supervisor2-stormcluster.gobalto.com:
inet: 10.110.20.11
roles: [supervisor, logviewer]
supervisor3-stormcluster.gobalto.com:
inet: 10.110.20.9
roles: [supervisor, logviewer]
zookeeper-stormcluster.gobalto.com:
inet: 10.110.20.8
roles: [zookeeper, proxy]
storm cluster group_var
Note: This is a current configuration for deploy playbooks and future cluster playbooks. It should be refactored to pull AWS tags for the configuration.
.
├── sites-available
│ ├── default
│ ├── nimbus1-stormcluster.gobalto.com
│ ├── nimbus2-stormcluster.gobalto.com
│ ├── storm-cluster.gobalto.com
│ ├── supervisor1-stormcluster.gobalto.com
│ ├── supervisor2-stormcluster.gobalto.com
│ └── supervisor3-stormcluster.gobalto.com
└── sites-enabled
├── nimbus1-stormcluster.gobalto.com -> /etc/nginx/sites-available/nimbus1-stormcluster.gobalto.com
├── nimbus2-stormcluster.gobalto.com -> /etc/nginx/sites-available/nimbus2-stormcluster.gobalto.com
├── storm-cluster.gobalto.com -> /etc/nginx/sites-available/storm-cluster.gobalto.com
├── supervisor1-stormcluster.gobalto.com -> /etc/nginx/sites-available/supervisor1-stormcluster.gobalto.com
├── supervisor2-stormcluster.gobalto.com -> /etc/nginx/sites-available/supervisor2-stormcluster.gobalto.com
└── supervisor3-stormcluster.gobalto.com -> /etc/nginx/sites-available/supervisor3-stormcluster.gobalto.com
An NGiNX reverse proxy is absolutely required to all access to all the systems for the web access. Additionally a /etc/hosts must be configured for every system so that all the components can properly communicate to each other.
# LogViewer example
server {
listen 8000;
server_name supervisor1-stormcluster.gobalto.com;
location / {
proxy_pass http://10.110.20.10:8000;
}
}
Two examples of proxy configuration. Each Logviewer and UI end-point must have a configuration.
# UI example
server {
listen 80;
server_name storm-cluster.gobalto.com nimbus-stormcluster.gobalto.com;
location / {
proxy_pass http://10.110.20.7:8080;
}
}
storm.zookeeper.servers:
- "zookeeper-stormcluster.gobalto.com"
nimbus.seeds: ["nimbus-stormcluster.gobalto.com", "nimbus2-stormcluster.gobalto.com"]
nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
supervisor.childopts: "-Djava.net.preferIPv4Stack=true"
worker.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
storm.local.dir: "/app/storm"
Apache Storm 1.x (storm.yaml)
The cluster needs to have a list of all zookeepers and nimbus servers
A topology is a blue print of the actual transformation. It is created in software code and packaged as a Java JAR.
This is what is submitted to a topology.
Sirius Topology transforms data from Activate (Source) to a Star Schema (Star).
Therefore you need to create a database schema (tables) for the audit logs on the source, and the star schema. This process is called migration, and there are two configs.
The Spouts and Bolts will use tenantConfig.
{
"dev": {
"username": "storm_user",
"password": null,
"database": "star_covance",
"host": "localhost",
"dialect": "postgres"
},
"test": {
"username": "storm_user",
"password": null,
"database": "star_covance",
"host": "localhost",
"dialect": "postgres"
}
}
Sequelize (config.json)
There are two migration operations: source and star. The environment variable NODE_ENV determines, which configuration to utilize.
{
"covance" : {
"sourceDBInfo": {
"database": "storm_togo_dev",
"host": "REDACTED",
"port": 5432,
"username": "storm_togo_test",
"password": "REDACTED"
},
"starDBInfo": {
"database": "storm_dev",
"host": "REDACTED",
"port": 5432,
"username": "storm_dev_user",
"password": "REDACTED"
}
}
}
Sirius (tenantconfig.json)
The topology job running in the Apache Storm cluster has an embedded configuration.
This configured per tenant (customer).
Note that NODE_ENV cannot be supported in a cluster because there is no plausible way to configure the env var for each JVM. This is control by Apache Storm project developers.
Configure, Build, Stop, Migrate, Deploy (Submit Topology), Status (load_sql)
DOCKER_REPO="gobaltoops/sirius"
docker build -t=${DOCKER_REPO}:base .
docker push ${DOCKER_REPO}:base
DOCKER_REPO="gobaltoops/sirius"
docker build -t=${DOCKER_REPO}:maven --no-cache=true .
docker push ${DOCKER_REPO}:maven
DOCKER_REPO="gobaltoops/sirius"
docker build -t=${DOCKER_REPO}:storm --no-cache=true .
docker push ${DOCKER_REPO}:storm
There are three base systems that are required.
Unlike a self-contained web application, Apache Storm only accepts a JAR.
Thus this container is thus only used for build-configuration-deploy. It has a self-contained (segregated) Apache Storm, Java JDK 8, Maven build system, and Node JS environment.
This is used to build, configure, and deploy a topology to a cluster.
FROM gobaltoops/sirius:maven
ENV APP_ROOT /gobalto
ENV NODE_ROOT ${APP_ROOT}/src/main/resources/resources/
ENV TEST_ROOT ${APP_ROOT}/test/
WORKDIR ${APP_ROOT}
RUN mkdir -p ${TEST_ROOT} && \
mkdir -p ${NODE_ROOT} && \
mkdir -p ${APP_ROOT}/output
#### UNIT TESTS
COPY test/package.json ${TEST_ROOT}
RUN npm -g install mocha istanbul && \
cd ${TEST_ROOT} && \
npm install
#### NODE LIBRARY SUPPORT
COPY src/main/resources/resources/package.json ${NODE_ROOT}
RUN cd ${NODE_ROOT} && \
npm -g install sequelize@3.23 sequelize-cli pg pg-hstore && \
npm install
VOLUME ${APP_ROOT}/logs/
VOLUME ${APP_ROOT}/output/
#### COPY REST OF CODE
COPY . ${APP_ROOT}/
#### SIRIUS CONFIG SCRIPT SUPPORT
RUN apt-get update && \
apt-get install -y libpq-dev python3-pip
RUN pip3 install --upgrade pip setuptools wheel && \
pip3 install psycopg2
#### Needed for Psycopg2 output from UTF-8 Postgres database
ENV PYTHONIOENCODING=utf-8
#### LINK SIRIUS CONFIG SCRIPT
ENV COMMON_SCRIPTS ${APP_ROOT}/ci/docker/configs/common/
RUN ln -sf ${COMMON_SCRIPTS}/sirius_cfg.py /usr/local/bin/sirius
#### KEEP ALIVE
CMD while :; do sleep 1; done
#!/bin/sh
# Skip Build if no change and Redeploy is 0
#[ $SKIP_BUILD -eq 1 -a $REDEPLOY -eq 1 ] || exit 0
CUSTOMER=%customer%
GIT_HASH_SHORT=$(/usr/bin/git log --abbrev-commit --abbrev=8 --max-count=1 --format=%h)
ssh storm-dev-01 sirius_deploy ${CUSTOMER} dev ${GIT_HASH_SHORT}
# BUILD AND SHIP
/usr/bin/docker login -u ${DOCKER_HUB_USER} -p ${DOCKER_HUB_PSSWD} -e ${DOCKER_HUB_EMAIL}
/usr/bin/docker build -t ${DOCKER_REPO}:${GIT_HASH_SHORT} .
/usr/bin/docker push ${DOCKER_REPO}:${GIT_HASH_SHORT}
# DEPLOY PROCESS
CUSTOMER=%customer%
GIT_HASH_SHORT=$(/usr/bin/git log --abbrev-commit --abbrev=8 --max-count=1 --format=%h)
ssh storm-dev-01 sirius_deploy ${CUSTOMER} dev ${GIT_HASH_SHORT}
There are two steps for the build process currently.
#!/bin/bash
CUST=$1
ENV=$2
HASH=$3
SCRIPT=$(echo $0 | awk -F/ '{ print $NF }')
if [ $# -lt 3 ]; then
echo 1>&2 "$0: not enough arguments, usage is '$SCRIPT CUST ENV HASH'"
exit 2
elif [ $# -gt 3 ]; then
echo 1>&2 "$0: too many arguments, usage is '$SCRIPT CUST ENV HASH'"
exit 2
fi
# The three arguments are available as "$1", "$2", "$3"
ansible-playbook -v -e "git_hash_short=${HASH} customer=${CUST} env=${ENV}" \
/etc/ansible/playbooks/sirius_deploy.yml
#!/bin/bash
CUST=$1
ENV=$2
HASH=$3
SCRIPT=$(echo $0 | awk -F/ '{ print $NF }')
if [ $# -lt 3 ]; then
echo 1>&2 "$0: not enough arguments, usage is '$SCRIPT CUST ENV HASH'"
exit 2
elif [ $# -gt 3 ]; then
echo 1>&2 "$0: too many arguments, usage is '$SCRIPT CUST ENV HASH'"
exit 2
fi
# The three arguments are available as "$1", "$2", "$3"
ansible-playbook -v -e "git_hash_short=${HASH} customer=${CUST} env=${ENV}" \
/etc/ansible/playbooks/sirius_deploy.yml
---
- hosts: all
tasks:
- name: Add sirius_container to storm_clusters group
add_host: name=sirius_container groups=storm_clusters
- name: Add sirius (docker) to storm_clusters group
add_host:
name: sirius
groups: storm_clusters
ansible_connection: docker
ansible_ssh_user: root
ansible_become_user: root
ansible_become: yes
- hosts: sirius_container
connection: local
roles:
- sirius_container
tasks:
- hosts: sirius
roles:
- sirius_deploy
---
- name: Test External Variables
fail: msg="Bailing out. This role requires '{{ item }}'"
when: "{{ item }} is not defined"
with_items: "{{ required_vars }}"
- include: setup.yml
- include: config.yml
---
# tasks to configure
- name: Include customers variables
include_vars: customers.yml
- name: Configure envfile
template:
src: dev.env.j2
dest: "{{host_staging_dir}}/templates/dev.env"
- name: Configure storm configuration (storm.yaml)
template:
src: storm.yaml.j2
dest: "{{host_staging_dir}}/templates/storm.yaml"
- name: Configure storm hosts enviroment
template:
src: hosts.j2
dest: "{{host_staging_dir}}/templates/hosts"
# dev.env.j2
{% for key, val in customers[customer].iteritems() %}
{{ key }}="{{ val }}"
{% endfor %}
# hosts.j2
{% for key, val in storm_clusters[storm_cluster].iteritems() %}
{{ val['inet'] }} {{ key }}
{% endfor %}
# storm.yaml.j2
nimbus.seeds: [{{ storm_clusters[storm_cluster].keys() |
select('search', 'nimbus') |
join(", ") }}]
---
# task to setup container build environment
# these task can be run on localhost or remote system
- name: Include docker variables
include_vars: docker.yml
- name: Log into DockerHub
docker_login:
username: "{{ docker_hub_username }}"
password: "{{ docker_hub_password }}"
email: "{{ docker_hub_email }}"
- name: Make dirs
file: path={{item}} state=directory mode=0755
with_items:
- "{{ host_staging_dir }}/logs"
- "{{ host_staging_dir }}/templates"
- "{{ host_staging_dir }}/target"
- name: Find if select container exists
shell: docker ps -a | grep -q '{{ name_app }}$'
register: sirius_container
ignore_errors: true
- name: Stop and remove the container if it already exists
shell: 'docker stop {{ name_app }} && docker rm {{ name_app }}'
when: sirius_container.rc == 0
- name: Build Sirius Container
docker:
name: "{{ name_app }}"
image: gobaltoops/sirius:{{ git_hash_short }}
state: reloaded
pull: always
command: bash "{{ app_root }}"/"{{ config_path }}"/"{{ app_env }}"/wrapper.sh
volumes:
- "{{ host_staging_dir }}/templates:/templates"
- "{{ host_staging_dir}}/logs:{{ app_root }}/logs"
- "{{ host_staging_dir }}/target:{{ app_root }}/output"
env:
APP_ROOT: "{{ app_root }}"
CUSTOMER: "{{ customer }}"
NODE_ENV: "{{ app_env }}"
- name: Add Docker Connection
add_host:
name: "{{ name_app }}"
groups: storm_clusters
ansible_connection: docker
ansible_ssh_user: root
ansible_become_user: root
ansible_become: yes
- name: Build Sirius Container
docker:
name: "{{ name_app }}"
image: gobaltoops/sirius:{{ git_hash_short }}
state: reloaded
pull: always
command: bash "{{ app_root }}"/"{{ config_path }}"/"{{ app_env }}"/wrapper.sh
volumes:
- "{{ host_staging_dir }}/templates:/templates"
- "{{ host_staging_dir}}/logs:{{ app_root }}/logs"
- "{{ host_staging_dir }}/target:{{ app_root }}/output"
env:
APP_ROOT: "{{ app_root }}"
CUSTOMER: "{{ customer }}"
NODE_ENV: "{{ app_env }}"
- name: Add Docker Connection
add_host:
name: "{{ name_app }}"
groups: storm_clusters
ansible_connection: docker
ansible_ssh_user: root
ansible_become_user: root
ansible_become: yes
---
# tasks file for sirius_deploy
- name: Test External Variables
fail: msg="Bailing out. This role requires '{{ item }}'"
when: "{{ item }} is not defined"
with_items: "{{ required_vars }}"
- name: Config db connections from envfile
command: sirius config {{ envfile }}
- name: Build topology jar
command: sirius build {{ git_hash_short }}
- name: Stop active topologies
command: sirius stop
- name: Migrate source Activate db
command: sirius migrate source
- name: Migrate destination star db
command: sirius migrate star
- name: Deploy topology
command: sirius deploy {{ git_hash_short }}