Stack Overflow and its tags: a true story

Marco Cecconi

         @sklivvz         http://sklivvz.com

Scale

  • 1-100
    You & your friends
  • 100-10,000
    Early Product
  • 10,000-100,000,000
    <A TON OF PROBLEMS>
  • 100,000,000-Infinity
    World domination? I can't even

windows*

  • Custom sorts
  • Get page N of matches
  • Count the matches
  • Find related tags

Our Mission

Find all questions tagged "javascript" but not "jquery"

SET STATISTICS IO ON 
SET STATISTICS TIME ON 
GO

SELECT    TOP 50 Tag1.PostId, LastActivityDate
FROM      (
    SELECT p.Id PostId, LastActivityDate
      FROM Posts p
      JOIN PostTags pt ON p.Id = pt.PostId 
     WHERE pt.TagId = 3
    ) Tag1
LEFT JOIN (
    SELECT p.Id PostId
      FROM Posts p
      JOIN PostTags pt ON p.Id = pt.PostId 
     WHERE pt.TagId = 820
    ) Tag2 ON Tag1.PostId = Tag2.PostId
WHERE     Tag2.PostId IS NULL
ORDER BY  LastActivityDate DESC

SET STATISTICS TIME OFF
SET STATISTICS IO OFF
GO
Table 'PostTags'. Scan count 5, logical reads 6344, [...]
Table 'Posts'. Scan count 5, logical reads 302062, [...]
SQL Server Execution Times: CPU time = 5173 ms, elapsed time = 2194 ms. ​

Related Tags:

Find the top 10 tags in those questions

SET STATISTICS IO ON 
SET STATISTICS TIME ON 
GO

SELECT    TOP 10 PostTags.TagId, COUNT(1) as Number
FROM      (
    SELECT p.Id PostId, LastActivityDate
      FROM Posts p
      JOIN PostTags pt ON p.Id = pt.PostId 
     WHERE pt.TagId = 3
    ) Tag1
LEFT JOIN (
    SELECT p.Id PostId
      FROM Posts p
      JOIN PostTags pt ON p.Id = pt.PostId 
     WHERE pt.TagId = 820
    ) Tag2 ON Tag1.PostId = Tag2.PostId
JOIN      PostTags ON PostTags.PostId = Tag1.PostId
WHERE     Tag2.PostId IS NULL
AND       PostTags.TagId NOT IN (3, 820)
GROUP BY  PostTags.TagId
ORDER BY  2 DESC

SET STATISTICS TIME OFF
SET STATISTICS IO OFF
GO
Table 'PostTags'. Scan count 15, logical reads 169532, [...] 
read-ahead reads 122731 [...]
Table 'Posts'. Scan count 10, logical reads 329038, [...] 

SQL Server Execution Times: 
  CPU time = 18875 ms, elapsed time = 5399 ms. 
SET STATISTICS IO ON 
SET STATISTICS TIME ON 
GO

SELECT   TOP 100 Id, LastActivityDate
FROM     Posts 
WHERE    PostTypeId = 1
AND      CONTAINS(Tags, '"javascript" AND NOT "jquery"')
ORDER BY 2 DESC

SET STATISTICS IO OFF
SET STATISTICS TIME OFF 
GO
Table 'Posts'. Scan count 5, logical reads 179483, [...]

SQL Server Execution Times: 
CPU time = 5202 ms, elapsed time = 2044 ms. ​ 
  • Only "letters" allowed, but tags have hyphens, octathorpes...
  • Replace them with "latin letters" e.g. "cç"...
  • Stop words like "and", "or" are not indexed...
  • Wrap all words in other "latin letters", e.g. "àsqlé"
CONTAINS (tags, 'éûnetà AND écñà')
  • Keep a list of all the questions (Ids and data used for sorting only)
  • Use optimized sort algorithms or sub-indices

The tag engine typically serves out queries at about 1ms per query. Our questions list page runs at an average of about 35ms per page render, over hundreds of thousands of runs a day.

-- Sam Saffron, "In Managed Code We Trust"

  • List<Posts>
    • Dynamically allocates space
    • Keeps a list of pointers only
  • Post[] 
    • Space fixed once and for all at load time
  • struct Post
    • No pointer, direct embedding

Game programming tricks

  • Post.List<Tags>
    • Becomes a pointer
    • Size of "Post" is variable
  • Post.Tags[5] 
    • Space fixed once and for all at load time

Game programming tricks

  • Index = int[]
    • ​C-style "pointers"
  • fixed 
    • Prevents GC from moving memory

Game programming tricks

Just use map and reduce, right?

var results = new int[50];
var count = 0;
var relatedTags = new Dictionary<string, int>();

for (var i = 0; i < Index.length /* 10,000,000 */; i++)
{
    if (matcher.Match(Posts[Index[i]]) {
        count++;
        if (count<51) {
            results[count-1] = Posts[Index[i]]; 
        }
    }
    
    for(int t = 0; t<5; t++) {
        relatedTags[Posts[Index[i]].Tags[t]] = 
            relatedTags[Posts[Index[i]].Tags[t]] + 1;
    }
}

Just go parallel, right?

tag engine

CPU Cores

tag engine

CPU Cores

tag engine

CPU Cores

tag engine

CPU Cores

tag engine

CPU Cores

  • Many concurrent queries, few cores
  • Measure all the things!
  • Optimal parallelism: 2 threads
  • Better, but same order of magnitude
var results = new int[50];
var count = 0;
var relatedTags = new Dictionary<string, int>();

for (var i = 0; i < Index.length /* 10,000,000 */; i++)
{
    if (matcher.Match(Posts[Index[i]]) {
        count++;
        if (count<51) {
            results[count-1] = Posts[Index[i]]; 
        }
    }
    
    for(int t = 0; t<5; t++) {
        relatedTags[Posts[Index[i]].Tags[t]] = 
            relatedTags[Posts[Index[i]].Tags[t]] + 1;
    }
}
public class Matcher {

    public bool NonClosedOnly { get; set; }
    public bool NoAnswerOnly { get; set; }
    public bool BountiesOnly { get; set; }

    /* a few of these */

    public static bool Match(Post) {
        if (NonClosedOnly && Post.DateClosed != null) return false;
        if (NoAnswerOnly && Post.Answers > 0) return false;
        if (BountiesOnly && Post.BountyAmount == 0) return false;

        /** a few of these **/

        return true;
    }
}
public interface IMatcher { public bool Match(Post post) }

public class MatcherGenerator {
    public bool NonClosedOnly { get; set; }
    public bool NoAnswerOnly { get; set; }
    public bool BountiesOnly { get; set; }
    /* a few of these */

    public Func<Post, bool> GenerateMatch() {
        var il = IL.Implement(typeof(IMatcher).GetMethod(nameof(IMatcher.Match)));
        Label exclude = il.DefineLabel();

        if (NonClosedOnly) {
            EmitMemberGet(il, nameof(QuestionValue.IsClosed));
            il.Emit(OpCodes.Brtrue, exclude);
        }
        if (NoAnswersOnly) {
            EmitMemberGet(il, nameof(QuestionValue.AnswerCount));
            il.Emit(OpCodes.Brtrue, exclude);
        }
        if (BountiesOnly) {
            EmitMemberComparison(il, nameof(QuestionValue.BountySize), null);
            il.Emit(OpCodes.Brtrue, exclude);
        }
        /* a few of these */

        il.EmitLabel(exclude); /* etc */
        return il.CreateDelegate();
    }
}

Just approximate, right?

Many known cases where the top N results are not representative of the rest of the dataset

 

e.g. "Questions with bounties sorted by newest"

MongoDB is web scale, right?

MongoDB is web scale, right?

Not fast enough

Preliminary results

Time from 1500ms down to 700 (worst cases), but many hits <100ms

tag engine v2

CPU Cores

tag engine v2

CPU Cores

tag engine v2

CPU Cores

tag engine v2 GPU

CPU Cores

GPU Cores

tag engine v2

CPU Cores

GPU Cores

Simple query

tag engine v2

CPU Cores

GPU Cores

Complex query

tag engine v2

CPU Cores

GPU Cores

Complex query

Simple query

Benchmarks

http://blog.marcgravell.com/2016/05/how-i-found-cuda-or-rewriting-tag.html

Marc Gravell

  • Measure everything
  • Don't assume anything
  • Assume every canned recipe is wrong
  • Keep on evolving

...the moral of the story

Performance tuning Stack Overflow Tags

By Marco Cecconi

Performance tuning Stack Overflow Tags

  • 412