High Frequency Webpages

Learnings from running a huge website

How not to go insane

Learnings from running a huge website

Ffffffuuuuu......

What now? Help? Anybody?!?

Christian Riesen

he/him

Senior Developer at Liip.ch

Creator of rokka.io

+20 years of experience

PHP, JS, HTML, CSS

Can hit an orange at 30 Meters using a recurve bow

How, then what

  • Problems encountered
  • Solutions implemented
  • Effect observed
  • The what comes at the very end

How to get a really slow site, that somewhat works and millions of users visit

Don't panic!

Step 1: Saturate the connection

  • Page runs on a dedicated connection
  • Due to high traffic, connection was saturated
  • Requests queue
  • CPU load is high
  • Memory is used up
  • Throwing more hardware at it doesn't work

Get rid of everything

  • Images
  • CSS
  • Javascript

Use a CDN

  • Servers all over the world
  • More servers where you have more users
  • Integrating urls to assets to those servers directly
  • All internal magic

Did it help?

  • A little
  • More users could use the site now.. win I guess
  • Site still had slowdowns
  • Hardware still overloaded

Step 2: Dynamic render everything

  • Each request having to do checks while rendering
  • Is this an admin? Is the user logged in?

Split admin portion

  • Copy of site under "secret" url, with login
  • Public site has all admin material and checks removed

Did it help?

  • For the admins, yes, a lot
  • No detectable changes for users

Step 3: Render all data live

  • Access DB on each request
  • Take page data (title, description) and render
  • Re-render the same things over and over and...

Pre-render

  • Render page into static template file
  • Still need to replace CDN location -> cheap
  • In background task, check for old prerenders and rebuild them
  • On admin changes, rebuild pre-render
  • Exclude very dynamic pages from pre render (list, search)
  • Prerender snippets (list entry, search result line, ESI)

Did it help?

  • Reduced DB load massively
  • Decent load reduction on web server
  • Output went up by about 50%

Step 4: Search

  • Search in the database with LIKE queries
  • Don't cache anything
  • Don't track anything
  • If it's bad in this regard, do it

Offload to indexing engine

  • Feed data into engine
  • Normalize search inputs
  • Do searches against engine
  • Engine returns ids
  • Cache query with result ids
  • Show prerendered snippets from ids

Track searches

  • Track which queries are used
  • How many times is a query used
  • How many results did the query return
  • Track per day, month, year
  • Improve search results by looking at 0 results queries that were used the most
  • Track when queries were optimized
  • Add hidden "keywords" field so content can be found

Did it help?

  • Lightning fast searches
  • More meaningful hits
  • Better user experience
  • Less ressources used overall
  • Page got a good speed boost

Step 5: Advertisements

  • Ads are placed by country
  • Tracking of ads, so users don't get too many
  • Reporting of ad displays

If you have to do it...

  • Using the same magic as with CDN
  • Use existing stats (Analytics) to estimate views
  • Simply don't use impressions or clicks, but time

Did it help?

  • Impact for endusers near zero
  • More talking to ad customers
  • More time in making stats for customers

Step 6: Add customization

  • Custom feeds
  • Favorites
  • Tracking of groupings of interests
  • Comments

Progressive enhancement

  • Get all JS files externally again
  • Deliver the same HTML to all
  • Use JS magic to add the customized parts
  • Heavy use of caching and prerendering
  • Invalidate caches, render again on content change
  • Pray nobody figures out what you are doing

Did it help?

  • Not really, but it at least didn't hurt.. much
  • Database queries spiked, memory caches helped
  • Figuring out what was stale and what not needed to be parallelized
  • Fake instant responses are indistinguishable from real ones to the user

Guess the year

What did you guess?

2001

It looked worse in 2001, this is the only complete image I have

ShareReactor

  • Started in the middle with 2001
  • End of 2001 line was already saturated
  • Only had metadata links (ed2k, magnet)
  • Zero downloads

Throw out the rules
What you know is wrong

  • Large scale doesn't play by the rules
  • Your knowledge doesn't work on insane scales

Keep it as simple as possible

  • Pre-render, not on request, proactive only
  • File caches work just fine, nothing fancy needed
  • Memory caches are good, but can fail, what then?
  • DB is slow with searching, use indexing engines
  • Monitor and track all the things
  • Know the internals of server software
  • A saturated connection is a pain to deal with

Database de-normalization

  • Wisdom says don't do it
  • Millions of requests says otherwise
  • From atomic to eventual consistency

CDN's use them

  • Images (rokka.io)
  • General requests (Cloudflare, Cloudfront)
  • Deliver only dynamic content from server

Progression

A different beast

  • Large sites behave differently
  • Bad ideas suddenly become good solutions
  • Experiment
  • Measure
  • Don't panic!

Thank you for listening

I hope you enjoyed the talk

Questions?

High Frequency Webpages

By Christian Riesen

High Frequency Webpages

Learnings from running a huge website

  • 678