Not Just A Blog, But A Knowledge Warehouse

How to make your content searchable

Gleb Bahmutov

@bahmutov

our planet is in imminent danger

survival is possible* but we need to act now

  • change your life
  • dump banks financing fossil projects
  • join an organization

Agenda

  • My personal blogging story
  • Common blogging mistakes
  • Beyond writing: slides & videos
  • Search index
    • scraping HTML
    • scraping slides
    • scraping videos

Speaker: Gleb Bahmutov PhD

C / C++ / C# / Java / CoffeeScript / JavaScript / Node / Angular / Vue / Cycle.js / functional programming / testing

Gleb Bahmutov

Sr Director of Engineering

Variety

  • Setting up PC / OS / tools
  • Using Git / Node / X
  • Debugging Y / Implementing Z

How did I do ...?

I know, I have done this before, it is just ...

2006 - company wiki?

2008 - company wiki

2011 - company wiki / TiddlyWiki

2013 - personal blog!

2013 - personal blog!

✅ write for yourself

✅ invest into your expertise

✅ level up in your career

💀 against company policy?

15+9+21+138+424=607

Hexo + GitHub Pages + Custom domain

My blog index is just a single giant HTML page (80KB gzip)

Markdown, local images (optimized with pngquant-bin), GIFs using Kap

App 6 posts per month

Thinking Is Hard. Writing Is Mechanical

You Are Your Audience

2000 readers per day

Most popular posts: "Setting Up Prettier In VSCode", about recruiting

Tips & Tricks &

Common mistakes

✅ Use your domain

✅ Keep it simple

You do not need to support guest authors / fancy editors / syndication

You want to be able to run the blog locally

$ npm start

✅ Use tags and categories

tags

category

✅ Use tags and categories

✅ Use tags and categories

✅ Have clear dates

✍️ Do not shy away from updating and expanding the posts 

✍️ Do not shy away from updating and expanding the posts 

updating (really)

old posts

📝 TOC for longer posts

🎓 Explain what the reader will learn from reading the post

🎁 Links to the source code

🎁🎁🎁 Make the source evergreen

🔗 Link the blog to your other projects

🍬 Reward the readers who have reached the finish line

All The Knowledge is There

Gleb, I feel half of your messages to me are links to your blog posts

Cmd+F works on the current page

Tip: use longer titles and descriptive abstracts

Confession: I often do/did not know where to find what I have done ...

Google to the rescue?

Nope

Algolia

  • I am unaffiliated, just a user / Ambassador

  • Generous free plan

  • Good documentation, scraping tools, nice UI

Create Algolia App

Create Search Index

Workflow

  1. After every deploy 🔼
    1. Scrape the site 🕷
  2. Set search widget to point at the index using Algolia SDKs

Scraping the Site

Document Structure

{
  "index_name": "scrape-test",
  "start_urls": ["https://glebbahmutov.com/triple-tested/"],
  "selectors": {
    "lvl0": {
      "selector": ".site-name",
      "global": true
    },
    "lvl1": ".content__default h1",
    "lvl2": ".content__default h2",
    "lvl3": ".content__default h3",
    "lvl4": ".content__default h4",
    "lvl5": ".content__default h5",
    "text": ".content__default p, .content__default li"
  }
}

Algolia config (JSON)

Algolia Config For Blog

{
  "index_name": "Gleb Blog Cypress Posts",
  "start_urls": [
    "https://glebbahmutov.com/blog/"
  ],
  "stop_urls": [
    "https://glebbahmutov.com/blog/archives/",
    "https://glebbahmutov.com/blog/tags/"
  ],
  "js_render": false,
  "selectors_exclude": ["header", "footer", "#sidebar"],
  "selectors": {
    "lvl0": {
      "selector": "title",
      "global": true
    },
    "lvl1": ".article .article-inner .article-entry h2",
    "lvl2": ".article .article-inner .article-entry h3",
    "lvl3": ".article .article-inner .article-entry h4",
    "lvl4": ".article .article-inner .article-entry .caption",
    "text": "header.article-header h2, 
      .article .article-inner .article-entry p, 
      .article .article-inner .article-entry figure.highlight .comment"
  }
}

Algolia Config For Blog

{
  "index_name": "Gleb Blog Cypress Posts",
  "start_urls": [
    "https://glebbahmutov.com/blog/"
  ],
  "stop_urls": [
    "https://glebbahmutov.com/blog/archives/",
    "https://glebbahmutov.com/blog/tags/"
  ],
  "js_render": false,
  "selectors_exclude": ["header", "footer", "#sidebar"],
  "selectors": {
    "lvl0": {
      "selector": "title",
      "global": true
    },
    "lvl1": ".article .article-inner .article-entry h2",
    "lvl2": ".article .article-inner .article-entry h3",
    "lvl3": ".article .article-inner .article-entry h4",
    "lvl4": ".article .article-inner .article-entry .caption",
    "text": "header.article-header h2, 
      .article .article-inner .article-entry p, 
      .article .article-inner .article-entry figure.highlight .comment"
  }
}

Scraping

# when scraping the site, inject secrets as environment variables
# then pass their values into the Docker container using "-e" syntax
# and inject config.json contents as another variable
- name: scrape the site 🧽
  env:
    APPLICATION_ID: ${{ secrets.APPLICATION_ID }}
    API_KEY: ${{ secrets.API_KEY }}
  run: |
    docker run \
    -e APPLICATION_ID -e API_KEY \
    -e CONFIG="$(cat config.json)" \
    algolia/docsearch-scraper:v1.6.0

use Algolia Docker image

Scraping

# when scraping the site, inject secrets as environment variables
# then pass their values into the Docker container using "-e" syntax
# and inject config.json contents as another variable
- name: scrape the site 🧽
  env:
    APPLICATION_ID: ${{ secrets.APPLICATION_ID }}
    API_KEY: ${{ secrets.API_KEY }}
  run: |
    docker run \
    -e APPLICATION_ID -e API_KEY \
    -e CONFIG="$(cat config.json)" \
    algolia/docsearch-scraper:v1.6.0

use Algolia Docker image

Presentations about documentation search

Scraping The Blog Posts

  • Only a few posts change every week
    • scraping every post is wasteful
  • Incremental scraping
    • keep timestamps of scraped posts
  • Stay tuned for a blog post!

Where Is Your Knowledge?

👍 positive reactions

  • Incremental scraper
  • Take title and description
  • (future) Take the transcript

YouTube

So How Do I Find It?

You Want Your Coworkers to Find Answers to Their Questions by Themselves Using Your Search

💡 Instead of Answering Questions 1 on 1

Update the documentation, or create an example, or write a blog post, or record a video

Then answer with a link to your search page 🔎

Build Up Your Expertise

Learn And Share In Public

Implement A Way To Find Things

Thank you 👏

Gleb Bahmutov

@bahmutov