πŸ‘‹πŸΎ Hi there,

I'm Karthick.

MS in CS @

Ex-Lead SRE @

Featured on the NASDAQ tower during the bell-ringing ceremony*

My Responsibilities πŸ‘¨πŸΎβ€πŸ’»

  • Owned Infra @ (Marketing automation Suite)
  • Improving Engineering Productivity
  • Improve Infra Security/Scalability/HA/Reliability/Resilience
  • Involve in System Design
  • AWS Cost Budgeting & Capacity Planning
  • Production Uptime - Maintaining SLA/SLO, Define RTO/RPO, RCA/PostMortem & Post-Incident Handling.
  • Manage and Mentor 3 SRE Engineers

Timeline⏳

  • Jul 2016: Full-Stack Engineer @ Zarget - 10th Engineer
  • Oct 2016: Saw Opportunity: Started Integration Team
  • Aug 2017: Got Acquired by Freshworks
    • Rebranded Zarget to Freshmarketer
  • Dec 2017: Saw Opportunity: Dev -> DevOps/SRE
  • First tasks: Rewrote Gradle tasks, introduced hot-swap, Improved Build time, Setup a new AWS Staging account.
  • Apr 2018: Gave my first tech talk on DevOps in Vietnam.
  • Rest is history!!!

Why I left Freshworks? πŸ’”

  • Bucketlist: Experience US Education. βœ…
  • Fall 2020: Got admitted to Georgia Tech. Covid happened😩
  • Fall 2021: Online MS in CS @ Georgia Tech
    (Sponsored by Freshworks) 🫑 πŸ™πŸ½
  • Fall 2022: Transferred to On-Campus βœ…
  • Got 100% Scholarship + Stipend πŸ€‘

✈️

Why ?

  • Startup -> Mid-Market -> Enterprise??
  • Job Fit
    • DevOps, CI/CD, Jenkins, Secret Mgt, Container Registry
    • AWS, Docker, Kubernetes
    • Scripting with Python, BASH
    • Tomcat/Gretty(jetty), Gradle, Nginx, Chef, Github
  • Culture Fit
    • Continuous Improvement and the Pursuit of Excellence
    • Respect and Invest
    • Rational Workplace
    • Learning and Self-Improvement

    • Credibility and Integrity

What am I looking in my new role?

  • Freedom to explore Multiple Technologies.
  • Career is like a Tripod
    • Solve interesting challenges (Happy work)
    • Career growth/promotion
    • Compensation, Rewards/Recognition

      All three need to be balanced for stability

Let's talk tech!

Project: Simplifying DB Migration with FlywayDB

The Problem? 😩

  • Tale of Manual DB Migration - Incident
  • No track, if migration has been applied/not.
  • Migration files were spread out!!
  • What about Data Migration?

Ways to Solve? πŸ’‘

WINNER!!! πŸŽŠπŸ†

What is FlywayDB ? 🀩

  • Open source tool to manage DB migration
  • Conventions over Configuration
  • Supports Plain SQL/Java-based Migration.
  • 6 Major commands: Info, Migrate, Repair, Clean, Baseline & Validate.
  • Multiple & Simple Setup - gradle, maven, ANT etc.

Our use-case? πŸ€”

  • Usually, Contain DDL Statements
  • Or Reference data, data fixes
    • Did we change the country code (or currency) table in this machine?
  • Use JDBC code for complex migration
    • Not Checksummed by default, unlike SQL migration.

build.gradle

Naming Conventions

FlywayInfo Meta table

The Challenges? πŸ˜΅β€πŸ’«

  1. Working with Maintenance Branch.
  2. Working with Feature Branch.
  3. Working with Lots of Migrations.

#1: Working with Maintenance Branches

  • DB Prod Schema evolves linearly.
  • Ensure maintenance migrations happen before migrations belonging to later releases.
  • Use major.minor version scheme for maintenance
    • V1.1 comes after V1 but before V2.
    • In practice:
      V001_01__increase_comment_size.sql

#2: Working with Feature Branches

#3: Working with Lots of Migrations

  • Squashing Migrations
  • Squash all applied migrations to one or two files
    • DDL + Reference Data
  • Use baseline command​​​​​​
    gradle flywaybaseline -Dflyway.baselineVersion= <version>

What happened next?

Was invited to Vietnam to deliver a talk on FlywayDB.

"Engineering Award" During All hands

Many teams in Freshworks adopted FlywayDB

But.

All happy stories has a twist! 😡

~2.5years later...

The Incident 🀯

  • QA deleted the flyway_schema_history table instead of dropping it to baseline. - Software Malfunction.
  • During Squash migration - The schema dump init migration contained:
    DROP TABLE IF EXISTS table1; 
  • Tables started dropping off from Production. ☠️
  • PagerDuty Alerts bombarded.
  • Force-Set ALB rule to show Maintenance message.
  • Used Point-in-time-recovery (PITR) immediately to restore DB to the time just before the incident.
  • It took 2.5 hours to recover - Partial Downtime 😞

Friday 8:00 PM

"Incidents like these do happen; that's fine. But why did it take 2.5 hours to recover a DB?"

- Senior Director of Engineering

πŸ€”

Lessons and Action items:

  • Upgraded Flyway to the latest version. (the issue was resolved in latest version)
    • Maintained a list of all the dependencies/frameworks we use and upgraded whichever was required. Quarterly Audit.
  • Introduced Delayed Read-Replica in RDS. (DB Recovery can be made < 5 minutes)
    • call mysql.rds_set_source_delay (3600); //1hour
  • Enforce Idempotent Scripts
    • All PR containing migration files will automatically include SREs for review.
  • No Late Deployments
    • Deployments on Tue & Thur < 4 PM - so Devs can look into them in case of post-deployment issues.
  • FlywayInfo - The script will run this before running the migration.
  • Created a new RDS user without DROP privileges.

The Good Stuff!

  • PagerDuty (Datadog Integration) was in place to alert on time. βœ…
  • Enabled Cloudwatch logging in RDS that showed us what caused this issue. βœ…
  • Automatic Snapshot/ PITR was in place. βœ…
  • Configured Internal DNS for RDS Endpoints - Switching endpoints was easy. βœ…

Project: Migrating Microservices to EKS

Architecture

Execution

Plan

  • List the microservices to be split from the monolith

  • Created a separate GitHub repository, separating all dependencies

  • Made Action Items with mini milestones:

    • Create a new VPC - in parallel to existing

      • Multi NAT

      • DR subnets

      • Public & Private subnets in at least 3 AZ

    • Dockerizing Containers (Write Dockerfile)

    • VPC peering with Platform Services **

    • Moving Secrets to AWS Secret Manager

    • Create ECR Repo for each microservices

    • Setting up Codebuild - Wrote buildSpec.yml

      • Push Commit hash along with the "latest" tag to identify the revision

    • Setup EKS Cluster and configure Nodegroups

    • Install Istio to the k8s cluster and attach ALB.

  • Moved all microservices to EKS on Staging. Requested QA to use the setup for a month and raise issues.

    • Most issues related to - VPC peering with platform services

  • We are in 4 regions

    • US-east-1

    • eu-central-1

    • ap-south-1

    • ap-southeast-2

  • We moved to one of the regions with the lowest traffic.

  • Later, we slowly migrated to other regions. At last, we migrated our primary region(US) to EKS.

  • Train QA & Dev

  • Provide Documentation on Deployment (B/G, scaling etc)

  • Provide access for Devs to the EKS cluster.

  • KT session on using the platform introduced tools like the "lens" (Fav ❀️) and k9s to view the k8s cluster without using the command line.

Execution

Plan

(continued...)

Mathworks | April 11, 2023

Made with Slides.com