Ops
TALKS
Knowledge worth sharing
#01
Florian Dambrine - Principal Engineer - @GumGum
Terraform
Agenda
What DOES it DO
***
Basics
***
DEEP dive
***
CHEATSHEET
What does it do
/ Terraform / terragrunt / atlantis
Terraform / Terragrunt / Atlantis ???
Engine that interacts with an SDKs (providers) to CRUD datasources & resources. Operations performed by Terraform are stored in a state)
Wrapper to keep your Terraform code DRY and ease multi-account management
Pull Request Bot to run Terraform on the fly when a PR is updated. It ensures a development workflow as well as acting as a central runner
RUNTIME
FROM hashicorp/terraform:0.12.29
# Caching providers + credstash
FROM lowess/terragrunt as tools
FROM runatlantis/atlantis:v0.15.0
COPY --from=tools /usr/local/bin /usr/local/bin
COPY --from=tools /opt/.terraform.d /opt/.terraform.d
COPY --from=tools /root/.terraformrc /home/atlantis/.terraformrc
COPY --from=tools /root/.terraformrc /root/.terraformrc
RECAP
RECAP
Read MORE !
Basics
/ Terraform State / Repo Layout / Local development / INFRA AS CODE GOTCHAS /
Terraform state management
Terraform must store state about your managed infrastructure and configuration. This state is used by Terraform to map real world resources to your configuration, keep track of metadata, and to improve performance for large infrastructures.
terraform.tfstate
locking
State isolation with Terragunt
locals {
remote_state_bucket = "terraform-state-${local.product_name}-${local.account_name}-${local.aws_region}"
remote_state_bucket_prefix = "${path_relative_to_include()}/terraform.tfstate"
}
remote_state {
backend = "s3"
config = {
encrypt = true
bucket = "${local.remote_state_bucket}"
key = "${local.remote_state_bucket_prefix}"
region = local.aws_region
dynamodb_table = "terraform-locks"
}
...
}
terraform/terragrunt/terragrunt.hcl
Terraform state & REPOSITORY layout
terragrunt
├── gumgum-ads
│ ├── account.hcl
│ └── virginia
├── gumgum-ai
│ ├── account.hcl
│ ├── oregon
│ └── virginia
├── terragrunt.hcl
└── versioning.hcl
locals {
remote_state_bucket = "terraform-state-${local.product_name}-${local.account_name}-${local.aws_region}"
remote_state_bucket_prefix = "${path_relative_to_include()}/terraform.tfstate"
}
remote_state {
backend = "s3"
config = {
encrypt = true
bucket = "${local.remote_state_bucket}"
key = "${local.remote_state_bucket_prefix}"
region = local.aws_region
dynamodb_table = "terraform-locks"
}
...
}
$ aws s3api list-buckets \
| jq -r '.Buckets[] | select(.Name | test("^terraform-state")) | .Name'
### In AI account
* terraform-state-verity-gumgum-ai-us-east-1
* terraform-state-verity-gumgum-ai-us-west-2
### In Ads account
* terraform-state-verity-gumgum-ads-us-east-1
# Example of accounts.hcl content
locals {
account_name = "gumgum-ads"
product_name = "verity"
}
S3 layout follows code repo thanks to path_relative_to_include() dictated by Terragrunt.
REpository layout (Depth 0->N)
terraform/terragrunt
gumgum-ads
gumgum-ai
virginia
oregon
virginia
prod
prod
stage
dev
prod
stage
N0
FREE FORM STRUCTURE
For example
- N+5 - Jira project / Team namespace
- N+6 - Infrastructure bit made of multiple components (Kafka)
- N+7 - Single infrastructure component (Zookeeper)
N+1
N+2
N+3
N+4
Local development
# Create the `tf` alias that will spin up a docker container
# with all the tooling required to operate Verity infrastructure
export GIT_WORKSPACE=${HOME}/Workspace"
alias tf='docker run -it --rm \
-v ~/.aws:/root/.aws \
-v ~/.ssh:/root/.ssh \
-v ~/.kube:/root/.kube \
-v "$(pwd):/terragrunt" \
-v "${GIT_WORKSPACE}:/modules" \
-w /terragrunt \
lowess/terragrunt:0.12.29'
### Terminal Laptop
$ cd $GIT_WORKSPACE/verity-infra-ops/terraform/terragrunt
$ tf
### Entered docker container
/terragrunt $ cd <path-to-component-you-want-to-test>
### Plan with terragrunt module override (point to local)
/terragrunt $ terragrunt plan --terragrunt-source /modules/terraform-verity-modules//<somemodule>
DEMo
INFRASTRUCTURE AS CODE GOTCHAS
Infrastructure code != Application code
#!/usr/bin/env python
import boto3
client = boto3.client('sqs')
sqs_queue = "ops-talk-sqs"
response = client.create_queue(
QueueName=sqs_queue,
Attributes = {
'DelaySeconds': 90
},
tags={
'Name': sqs_queue,
'Environment': 'dev'
}
)
print(response)
diff --git a/queue.py b/queue.py
index c19f2f6..c4538e5 100644
--- a/queue.py
+++ b/queue.py
@@ -6,7 +6,7 @@ client = boto3.client('sqs')
sqs_queue = "ops-talk-sqs"
-response = client.create_queue(
+queue = client.create_queue(
QueueName=sqs_queue,
Attributes = {
'DelaySeconds': 90
@@ -17,4 +17,4 @@ response = client.create_queue(
}
)
-print(response)
+print(queue)
Is this change harmless ---^ ?
INFRASTRUCTURE AS CODE GOTCHAS
Infrastructure code != Application code
# terraform.tf
locals {
sqs_name = "sqs-ops-talk"
}
resource "aws_sqs_queue" "terraform_queue" {
name = "${local.sqs_name}"
delay_seconds = 90
tags = {
Name = "${local.sqs_name}"
Environment = "dev"
}
}
diff --git a/terraform.tf b/terraform.tf
index 3c8cb08..8981f1d 100644
--- a/terraform.tf
+++ b/terraform.tf
@@ -4,7 +4,7 @@ locals {
sqs_name = "sqs-ops-talk"
}
-resource "aws_sqs_queue" "terraform_queue" {
+resource "aws_sqs_queue" "queue" {
name = "${local.sqs_name}"
delay_seconds = 90
tags = {
Is this change harmless ---^ ?
- Remember -
Infrastructure as code is made of code & state !
Deep-Dive
/ Terragrunt no-module / terragrunt provider OVERWRITes / Module versioning at scale / best practices
terragrunt with no modules
Please note that it is a better practice to build a module instead of writing plain terraform
# ---------------------------------------------------------------------------------------------------------------------
# TERRAGRUNT CONFIGURATION
# This is the configuration for Terragrunt, a thin wrapper for Terraform that supports locking and enforces best
# practices: https://github.com/gruntwork-io/terragrunt
# ---------------------------------------------------------------------------------------------------------------------
locals {
# Automatically load environment-level variables
environment_vars = read_terragrunt_config(find_in_parent_folders("env.hcl"))
# Extract out common variables for reuse
env = local.environment_vars.locals.environment
}
# Terragrunt will copy the Terraform configurations specified by the source parameter, along with any files in the
# working directory, into a temporary folder, and execute your Terraform commands in that folder.
terraform {}
# Include all settings from the root terragrunt.hcl file
include {
path = find_in_parent_folders()
}
# These are the variables we have to pass in to use the module specified in the terragrunt configuration above
inputs = {}
main.tf
variable "aws_region" {}
provider aws {
region = var.aws_region
}
resource "..."
terragrunt.hcl
DEMo
TERRAGRUNT Provider Overwrites
# ---------------------------------------------------------------------------------------------------------------------
# TERRAGRUNT CONFIGURATION
# This is the configuration for Terragrunt, a thin wrapper for Terraform that supports locking and enforces best
# practices: https://github.com/gruntwork-io/terragrunt
# ---------------------------------------------------------------------------------------------------------------------
locals {
# Automatically load environment-level variables
environment_vars = read_terragrunt_config(find_in_parent_folders("env.hcl"))
# Extract out common variables for reuse
env = local.environment_vars.locals.environment
}
# Terragrunt will copy the Terraform configurations specified by the source parameter, along with any files in the
# working directory, into a temporary folder, and execute your Terraform commands in that folder.
terraform {
source = "git::ssh://git@bitbucket.org/..."
}
generate "provider" {
path = "providers.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
# Configure the AWS provider
provider "aws" {
version = "~> 2.9.0"
region = var.aws_region
}
EOF
}
# Include all settings from the root terragrunt.hcl file
include {
path = find_in_parent_folders()
}
# These are the variables we have to pass in to use the module specified in the terragrunt configuration above
inputs = {}
terragrunt.hcl
DEMo
Warning !
Overwriting the providers.tf file from Terragrunt can lead to broken modules. You need to make sure that providers defined in the module are part of generate statement
Module versioning at scale
DEMo
$ tree -L 1 /terragrunt/gumgum-ai/virginia/prod/verity-api/dynamodb-*
dynamodb-verity-image
└── terragrunt.hcl
dynamodb-verity-images
└── terragrunt.hcl
dynamodb-verity-pages
└── terragrunt.hcl
dynamodb-verity-text
└── terragrunt.hcl
dynamodb-verity-text-source
└── terragrunt.hcl
dynamodb-verity-video
└── terragrunt.hcl
# dynamodb-*/terragrunt.hcl
terraform {
source = "/terraform-verity-modules//verity-api/dynamodb?ref=1.4.2"
}
# dynamodb-*/terragrunt.hcl
locals {
versioning_vars = read_terragrunt_config(find_in_parent_folders("versioning.hcl"))
version = lookup(local.versioning_vars.locals.versions, "verity-api/dynamodb", "latest")
}
terraform {
source = "/terraform-verity-modules//verity-api/dynamodb?ref=${local.version}"
}
# versioning.hcl
locals {
versions = {
"latest": "v1.7.2"
"verity-api/dynamodb": "v1.4.2"
"nlp/elasticache": "v1.7.2"
"namespace/app": "v1.3.1"
}
}
When new feature comes in, bump the version of the namespace you want to upgrade
Best practices - using locals
locals {
### Ease the access of sg ids from the peered VPC
peer_scheduler_sg = "${module.peer_discovery.security_groups_json["scheduler"]}"
peer_kafka_monitoring_sg = "${module.peer_discovery.security_groups_json["monitoring"]}"
peer_tws_sg = "${module.peer_discovery.security_groups_json["tws"]}"
peer_pqp_sg = "${module.peer_discovery.security_groups_json["pqp"]}"
verity_api_sg = data.terraform_remote_state.verity_ecs.outputs.ecs_sg_id
hce_sg = data.terraform_remote_state.hce_ecs.outputs.ecs_sg_id
sg_description = "Access cluster from [${var.account_name}] "
}
resource "aws_security_group_rule" "kafka_tcp_9092_ads_scheduler" {
type = "ingress"
from_port = 9092
to_port = 9092
protocol = "tcp"
source_security_group_id = local.peer_scheduler_sg ### ---> locals leads to cleaner code
security_group_id = aws_security_group.kafka.id
description = "${local.sg_description} ${local.peer_scheduler_sg}"
}
Locals = Data structure manipulation / Pattern enforcement / Cleaner code
Best practices - outputs for tfstate lookup
<module>
├── datasources.tf
├── ec2.tf
├── ecs.tf
├── outputs.tf
├── providers.tf
├── templates
│ └── ec2-cloud-config.yml.tpl
├── userdata.tf
├── variables.tf
└── versions.tf
# <module>/outputs.tf
output "ecs_esg_id" {
value = spotinst_elastigroup_aws.ecs.id
}
output "ecs_sg_id" {
value = aws_security_group.ecs.id
}
outputs.tf
{
"version": 4,
"terraform_version": "0.12.24",
"serial": 24,
"lineage": "ade31deb-1be6-2953-5b1c-432e52ed44b8",
"outputs": {
"ecs_esg_id": {
"value": "sig-c5916298",
"type": "string"
},
"ecs_sg_id": {
"value": "sg-1d8f44c8ab69634da",
"type": "string"
}
},
"resources": [...]
}
terraform.tfstate
Best practices - outputs for tfstate lookup
locals {
memcached_sgs = {
memcached-japanese-verity = data.terraform_remote_state.memcached_japanese_verity.outputs.memcached_sg_id
memcached-metadata-verity = data.terraform_remote_state.memcached_metadata_verity.outputs.memcached_sg_id
}
}
data "terraform_remote_state" "memcached_metadata_verity" {
backend = "s3"
config = {
bucket = var.remote_state_bucket
key = "${var.remote_state_bucket_prefix}/../../../../memcached-metadata-verity/terraform.tfstate"
region = var.aws_region
}
}
data "terraform_remote_state" "memcached_metadata_verity" {
backend = "s3"
config = {
bucket = var.remote_state_bucket
key = "${var.remote_state_bucket_prefix}/../../../../memcached-metadata-verity/terraform.tfstate"
region = var.aws_region
}
}
# Do something with local.memcached_sgs
resource "aws_security_group_rule" "memcached_tcp_11211_ecs_cluster" {
for_each = { for key, val in local.memcached_nlp_sgs: key => val }
type = "ingress"
from_port = 11211
to_port = 11211
protocol = "tcp"
description = "Access from ${title(var.ecs_cluster_name)}"
source_security_group_id = aws_security_group.ecs.id
security_group_id = each.value
}
- REMEMBER -
The variable outputs stored in terraform.tfstate files should be your single source of truth...
Best practices - You trust me and vice versa
resource "aws_security_group_rule" "kafka_tcp_9092_pritunl {
for_each = local.kafka_sgs
type = "ingress"
from_port = 9092
to_port = 9092
protocol = "tcp"
security_group_id = each.value
source_security_group_id = local.pritunl_sg
description = "Access Kafka cluster from VPN"
}
resource "aws_security_group_rule" "kafka_tcp_9092_monitoring {
for_each = local.kafka_sgs
type = "ingress"
from_port = 9092
to_port = 9092
protocol = "tcp"
security_group_id = each.value
source_security_group_id = local.monitoring_sg
description = "Access Kafka cluster from Monitoring"
}
resource "aws_security_group_rule" "kafka_tcp_9092_some_client_app {
for_each = local.kafka_sgs
type = "ingress"
from_port = 9092
to_port = 9092
protocol = "tcp"
security_group_id = each.value ### ?????
source_security_group_id = local.client_app ### ?????
description = "Access Kafka cluster from Some client APP"
}
terragrunt/gumgum-ai/virginia/prod/prism-kafka-cluster/kafka
- Remember -
If you need to rebuild the infra from scratch the order will be as follows VPC > Backends > Apps --- At the Backends stage, the Kafka cluster will be provisioned, Applications do not exists yet (🐔 & 🥚) which will break the provisioning
Best practices - You trust me and vice versa
terragrunt/gumgum-ai/virginia/prod/prism-kafka-cluster/kafka
.../prod/ops-talk/consumerapp
resource "aws_security_group_rule" "kafka_tcp_9092_pritunl {
for_each = local.kafka_sgs
type = "ingress"
from_port = 9092
to_port = 9092
protocol = "tcp"
security_group_id = each.value
source_security_group_id = local.pritunl_sg
description = "Access Kafka cluster from VPN"
}
resource "aws_security_group_rule" "kafka_tcp_9092_monitoring {
for_each = local.kafka_sgs
type = "ingress"
from_port = 9092
to_port = 9092
protocol = "tcp"
security_group_id = each.value
source_security_group_id = local.monitoring_sg
description = "Access Kafka cluster from Monitoring"
}
resource "aws_security_group_rule" "kafka_tcp_9092_some_client_app {
for_each = local.kafka_sgs
type = "ingress"
from_port = 9092
to_port = 9092
protocol = "tcp"
security_group_id = local.client_app -- Trusts for security groups
source_security_group_id = each.value -- is flipped
description = "Access Kafka cluster from Some client APP"
}
Best practices - LOOPS - Count is 😈
locals {
source_sg = "sg-60f3be13"
kafka_sgs = toset([
"sg-078f01cdfc0c41c9a",
"sg-09e683e9740fc3d8c"
])
}
resource "aws_security_group_rule" "kafka_tcp_9092_newapp" {
count = length(local.kafka_sgs)
type = "ingress"
from_port = 9092
to_port = 9092
protocol = "tcp"
security_group_id = element(tolist(local.kafka_sgs), count.index)
source_security_group_id = local.source_sg
description = "Access Kafka cluster from ops-talk"
}
resource "aws_security_group_rule" "kafka_tcp_9092_newapp" {
for_each = local.kafka_sgs
type = "ingress"
from_port = 9092
to_port = 9092
protocol = "tcp"
security_group_id = each.value
source_security_group_id = local.source_sg
description = "Access Kafka cluster from ops-talk"
}
Vs
DEMo
CHEATSHEET
Ops
TALKS
Knowledge worth sharing
By Florian