Dev Ops & SRE

pro Patek.cz

ale není to o tom ^^

Petr Michalec

Works as SRE at F5, before Volterra.io, Mirantis, IBM, ...

n(vi)m lover. developer. geek. quad fpv pilot.

all with the passion for the edge thing

 

On Twitter as @epcim

 

 vývoj a provoz moderních aplikací a infrastruktury

  • Čím chcete být
  • Co děláte když tvoříte

Buzzword

Co je teda DevOps?

GitOps

SecOps

NetOps

DevOps

DataOps

AIOps

5G

AI

hypperautomation

XR

Appka

  • dobrej nápad

  • dobre napsat

  • zabalit

  • distribuovat

  • provozovat *

  • v pc
  • v mobilu
  • v kontejneru
  • v cloudu
  • napříč cloudy, edge, iot
  • ...

Appka

Appka

Appka

Appka

Appka

Appka

Trend

Trend

DevOps

  • "Dev" je jako vývoj (ux, frontend, backend,...)
  • "Ops" je jako zajištění provozu (IT)

 

"SRE" je vlastně pracovní role, taky definováno jako principy a praktiky software inženýrství aplikované na infrastrukturu a provoz

 

Co je teda DevOps?

pattern

Co je teda DevOps?

x

x

x

x

x

x

x

?

x

Text

x

x

x

?

...být schopen nasadit a provozovat aplikace i infrastrukturu jako kód

A cultural and professional movement, focused on how we build and operate high velocity organizations, born from the experiences of its practitioners. 

Nebo taky

Jak

Proč

Stručně

Aplikace

Logiku

Stručně

Aplikace

Kontejner

Orchestrace

  • Integration (because microservices)

  • Deployment (because infra. as code)

  • Delivery (because business needs)

  • Measure/Improve (to improve)

  • Validation (because security, audit)

Continuous je taky buzz

  • Jake nastroje pouzivame

  • Ukazka

  • Jak vypada sprava aplikaci distribuovanych v cloudu

Co dal?

Lepidla

Git

Nástroj pro distribuované verzování zdrojového kódu

Původně pro Linux kernel.  (Jiné: Bzr, Hg, Svn).

 

https://git-scm.com/book/cs/v2

 

 

Git


git clone https://github.com/GoogleContainerTools/kpt-functions-catalog

git show
git diff

git pull -r

git checkout -b new-branch

git cherry-pick dbd4b43

git commit

git rebase origin/main
git rebase -i HEAD~5

Golang

Programovací jazyk od Google (2007).

High performance and fast development. Powerfull standard library.

 

https://go.dev/tour/welcome/1

 

 

Golang

package main
import "fmt"
func main() {
    fmt.Println("hello world")
}
$ go build hello-world.go

$ ls
hello-world    hello-world.go

$ ./hello-world
hello world

Golang

// GotplRenderBuf process templates to buffer
func (p *RenderPlugin) GotplRenderBuf(t string, out *bytes.Buffer) error {

	// read template
	tContent, err := ioutil.ReadFile(t)
	if err != nil {
		return fmt.Errorf("read template failed: %w", err)
	}

	// init
	fMap := sprig.TxtFuncMap()
	for k, v := range SprigCustomFuncs {
		fMap[k] = v
	}
    
	tpl := template.Must(
		template.New(t).Funcs(fMap).Parse(string(tContent)),
	)

	//render
	err = tpl.Execute(out, p.Values)
	if err != nil {
		return err
	}
	return nil
}

 

  • Python
  • Javascript / TypeScript

Taky záleží na co...

Terrafom

Nástroj který skrze deklarativní jazyk (DSL), indepotent konfigurace a CLI rozhraní umožňuje přes API spravovat vzdálené ......

 

https://registry.terraform.io/browse/providers

 

Terrafom

Terraform

variable "aws_region" {
  default     = "eu-west-1"
}

variable "domain" {
  default = "my_domain"
}

provider "aws" {
  region = "${var.aws_region}"
}

# Note: The bucket name needs to carry the same name as the domain!
# http://stackoverflow.com/a/5048129/2966951
resource "aws_s3_bucket" "site" {
  bucket = "${var.domain}"
  acl = "public-read"

  policy = <<EOF
    {
      "Version":"2008-10-17",
      "Statement":[{
        "Sid":"AllowPublicRead",
        "Effect":"Allow",
        "Principal": {"AWS": "*"},
        "Action":["s3:GetObject"],
        "Resource":["arn:aws:s3:::${var.domain}/*"]
      }]
    }
  EOF

  website {
      index_document = "index.html"
  }
}

# Note: Creating this route53 zone is not enough. The domain's name servers need to point to the NS
# servers of the route53 zone. Otherwise the DNS lookup will fail.
# To verify that the dns lookup succeeds: `dig site @nameserver`
resource "aws_route53_zone" "main" {
  name = "${var.domain}"
}

resource "aws_route53_record" "root_domain" {
  zone_id = "${aws_route53_zone.main.zone_id}"
  name = "${var.domain}"
  type = "A"

  alias {
    name = "${aws_cloudfront_distribution.cdn.domain_name}"
    zone_id = "${aws_cloudfront_distribution.cdn.hosted_zone_id}"
    evaluate_target_health = false
  }
}

resource "aws_cloudfront_distribution" "cdn" {
  origin {
    origin_id   = "${var.domain}"
    domain_name = "${var.domain}.s3.amazonaws.com"
  }

  # If using route53 aliases for DNS we need to declare it here too, otherwise we'll get 403s.
  aliases = ["${var.domain}"]

  enabled             = true
  default_root_object = "index.html"

  default_cache_behavior {
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "${var.domain}"

    forwarded_values {
      query_string = true
      cookies {
        forward = "none"
      }
    }

    viewer_protocol_policy = "allow-all"
    min_ttl                = 0
    default_ttl            = 3600
    max_ttl                = 86400
  }

  # The cheapest priceclass
  price_class = "PriceClass_100"

  # This is required to be specified even if it's not used.
  restrictions {
    geo_restriction {
      restriction_type = "none"
      locations        = []
    }
  }

  viewer_certificate {
    cloudfront_default_certificate = true
  }
}

output "s3_website_endpoint" {
  value = "${aws_s3_bucket.site.website_endpoint}"
}

output "route53_domain" {
  value = "${aws_route53_record.root_domain.fqdn}"
}

output "cdn_domain" {
  value = "${aws_cloudfront_distribution.cdn.domain_name}"
}

Ansible

Je taky nástroj s deklarativní konfigurací, CLI rozhraním a taky umožňuje spravovat vzdálené "servery, routery".

Tzv. config management.

 

https://www.guru99.com/ansible-tutorial.html

https://docs.ansible.com/

 

(Jiné: Salt, https://saltproject.io/ ).

 

Ansible


# playbook-base.yaml

- name: base
  become: true
  become_method: sudo
  hosts: all
  roles:
    - sshd
    - users
    - system
    - netplan
    - network
    #- hardening
    #- cleanup
  serial: "{{ batch_size|default(10) }}"

Ansible

- name: sshd configuration file
  template:
    src: sshd_config.metal.j2
    dest: "{{ sshdconfig }}"
    owner: 0
    group: 0
    mode: 0600
    validate: "sshd -T -f %s"
    backup: yes
  vars:
    sshd_allow_users: "{{ (admin_users + autom_users) | join(' ') }} "
  notify:
  - restart sshd
# when: ansible_virtualization_role != "guest" or ansible_virtualization_type != "docker"

Gitlab CI

GitLab je DevOps platforma! Hurá.

Server pro Git repositáře. Web UI. Správa issues. A taky hlavně CI

 

https://gitlab.com

 

(Jiné: Github https://github.com, Gitea https://gitea.io ).

 

Gitlab CI

Gitlab CI

Argo CD

Deklarativní Continuous Delivery. Pro GitOps. Rozumíme...

 

"infrastruktura = kód"; Šroubovák co instaluje ten kód do Kubernetes.

 

https://argoproj.github.io/cd/

 

Argo CD

Argo CD

Argo CD

Prometheus

Původně "SoundCloud" (2012).

multi-dimensional data model, operational simplicity, scalable data collection, and a powerful query language

 

https://prometheus.io/

 

Prometheus

      {
        name: 'etcd-service',
        rules: [
          {
            alert: 'EtcdDatabaseSpaceFilling',
            expr: |||
              100 - (etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes) * 100 < 15
            ||| % $._config,
            'for': '10m',
            labels: {
              severity: 'major',
              identifier: '{{ $labels.instance }}',
              group: 'Infrastructure',
              service_name: 'etcd',
              tenant: 'ves-sre',
            },
            annotations: {
              display_name: 'Database Error',
              description: 'Etcd database {{ $labels.instance }} is filling up. Only {{ $value }}% of space is available.',
            },
            docs:: {
              name: |||
                TODO
              |||,
              description: |||
                Etcd database is filling up.
              |||,
              action: |||
                This has to be solved in next business day working hours by L2.
              |||,
            },
          },

Prometheus

# Aggregating up requests per second that has a path label:
- record: instance_path:requests:rate5m
  expr: rate(requests_total{job="myjob"}[5m])

- record: path:requests:rate5m
  expr: sum without (instance)(instance_path:requests:rate5m{job="myjob"})

Recording Rule Example 2
================================
# Calculating a request failure ratio and aggregating up to the job-level failure ratio:
- record: instance_path:request_failures:rate5m
  expr: rate(request_failures_total{job="myjob"}[5m])

- record: instance_path:request_failures_per_requests:ratio_rate5m
  expr: |2
      instance_path:request_failures:rate5m{job="myjob"}
    /
      instance_path:requests:rate5m{job="myjob"}
# Aggregate up numerator and denominator, then divide to get path-level ratio.
- record: path:request_failures_per_requests:ratio_rate5m
  expr: |2
      sum without (instance)(instance_path:request_failures:rate5m{job="myjob"})
    /
      sum without (instance)(instance_path:requests:rate5m{job="myjob"})

Fluentbit

Treasure Data "fluentd --> fluentbit" (2014).

super fast, lightweight, and highly scalable logging and metrics processor

 

https://fluentbit.io/

 

Fluentbit

Vector.dev

DataDog -  lightweight, ultra-fast tool for building observability pipelines.

 

https://vector.dev

 

Vector.dev

Vector.dev

 ## Fluent source
    {{- if has .vector_enable_fluent $enabled }}
    [sources.in_fluent]
      type = "fluent"
      address = "0.0.0.0:24224"

      connection_limit = 2000
      keepalive.time_secs = 30
      receive_buffer_bytes = 1048576

      tls.enabled = true
      tls.verify_certificate = true
      tls.ca_file = "/corp/secrets/identity/client_ca_with_fluent.crt"
      tls.crt_file = "/corp/secrets/identity/server.crt"
      tls.key_file = "/corp/secrets/identity/server.key"
    {{- else }}
    [sources.in_fluent]
      type = "file"
      include = ["/dev/null"]
    {{- end }}
    
    
    
    

    [transforms.remap_fluent_throttle_key]
      type = "remap"
      inputs = ["in_fluent"]
      source = '''
      if match(.tag, r'^alert\..*$') ?? false {
        ._throttle_key, err = join([.labels.cluster_name, .labels.alertname], separator: "_")
        if err != null {
          log("Unable to construct throttle key for alert cluster_name=" + to_string!(.labels.cluster_name) + ", alertname="+ to_string!(.labels.alertname) +". Dropping invalid event. " + err, level: "error", rate_limit_secs: 10)
          abort
        }
      } else {
        ._throttle_key, err = join([.cluster_name, .tag], separator: "_")
        if err != null {
          log("Unable to construct throttle key for cluster_name=" + to_string!(.cluster_name) + ", tag="+ to_string!(.tag) +". Dropping invalid event. " + err, level: "error", rate_limit_secs: 10)
          abort
        }
      }
      '''

Observability

Observability

Observability

Observability

Observability

Observability

Observability

Recap

  • Zapomeňte na "buzzwords"
  • Develop code, Deploy with code, Emit metrics, Improve
  • Čtěte dokumentaci ne StackOverflow a youtube
  • Tvořte cokoliv

Look Forward

Dev Ops & SRE

By Petr Michalec

Dev Ops & SRE

  • 282