Stack Overflow and its tags: a true story
Marco Cecconi
@sklivvz http://sklivvz.com
Scale
- 1-100
You & your friends - 100-10,000
Early Product - 10,000-100,000,000
<A TON OF PROBLEMS> - 100,000,000-Infinity
World domination? I can't even
windows*
- Custom sorts
- Get page N of matches
- Count the matches
- Find related tags
Our Mission
Find all questions tagged "javascript" but not "jquery"
SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
SELECT TOP 50 Tag1.PostId, LastActivityDate
FROM (
SELECT p.Id PostId, LastActivityDate
FROM Posts p
JOIN PostTags pt ON p.Id = pt.PostId
WHERE pt.TagId = 3
) Tag1
LEFT JOIN (
SELECT p.Id PostId
FROM Posts p
JOIN PostTags pt ON p.Id = pt.PostId
WHERE pt.TagId = 820
) Tag2 ON Tag1.PostId = Tag2.PostId
WHERE Tag2.PostId IS NULL
ORDER BY LastActivityDate DESC
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
GO
Table 'PostTags'. Scan count 5, logical reads 6344, [...]
Table 'Posts'. Scan count 5, logical reads 302062, [...]
SQL Server Execution Times: CPU time = 5173 ms, elapsed time = 2194 ms.
Related Tags:
Find the top 10 tags in those questions
SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
SELECT TOP 10 PostTags.TagId, COUNT(1) as Number
FROM (
SELECT p.Id PostId, LastActivityDate
FROM Posts p
JOIN PostTags pt ON p.Id = pt.PostId
WHERE pt.TagId = 3
) Tag1
LEFT JOIN (
SELECT p.Id PostId
FROM Posts p
JOIN PostTags pt ON p.Id = pt.PostId
WHERE pt.TagId = 820
) Tag2 ON Tag1.PostId = Tag2.PostId
JOIN PostTags ON PostTags.PostId = Tag1.PostId
WHERE Tag2.PostId IS NULL
AND PostTags.TagId NOT IN (3, 820)
GROUP BY PostTags.TagId
ORDER BY 2 DESC
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
GO
Table 'PostTags'. Scan count 15, logical reads 169532, [...]
read-ahead reads 122731 [...]
Table 'Posts'. Scan count 10, logical reads 329038, [...]
SQL Server Execution Times:
CPU time = 18875 ms, elapsed time = 5399 ms.
SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
SELECT TOP 100 Id, LastActivityDate
FROM Posts
WHERE PostTypeId = 1
AND CONTAINS(Tags, '"javascript" AND NOT "jquery"')
ORDER BY 2 DESC
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
GO
Table 'Posts'. Scan count 5, logical reads 179483, [...]
SQL Server Execution Times:
CPU time = 5202 ms, elapsed time = 2044 ms.
- Only "letters" allowed, but tags have hyphens, octathorpes...
- Replace them with "latin letters" e.g. "cç"...
- Stop words like "and", "or" are not indexed...
- Wrap all words in other "latin letters", e.g. "àsqlé"
CONTAINS (tags, 'éûnetà AND écñà')
- Keep a list of all the questions (Ids and data used for sorting only)
- Use optimized sort algorithms or sub-indices
The tag engine typically serves out queries at about 1ms per query. Our questions list page runs at an average of about 35ms per page render, over hundreds of thousands of runs a day.
-- Sam Saffron, "In Managed Code We Trust"
-
List<Posts>
- Dynamically allocates space
- Keeps a list of pointers only
- Dynamically allocates space
-
Post[]
- Space fixed once and for all at load time
-
struct Post
- No pointer, direct embedding
Game programming tricks
-
Post.List<Tags>
- Becomes a pointer
- Size of "Post" is variable
-
Post.Tags[5]
- Space fixed once and for all at load time
Game programming tricks
-
Index = int[]
- C-style "pointers"
-
fixed
- Prevents GC from moving memory
Game programming tricks
Just use map and reduce, right?
var results = new int[50];
var count = 0;
var relatedTags = new Dictionary<string, int>();
for (var i = 0; i < Index.length /* 10,000,000 */; i++)
{
if (matcher.Match(Posts[Index[i]]) {
count++;
if (count<51) {
results[count-1] = Posts[Index[i]];
}
}
for(int t = 0; t<5; t++) {
relatedTags[Posts[Index[i]].Tags[t]] =
relatedTags[Posts[Index[i]].Tags[t]] + 1;
}
}
Just go parallel, right?
tag engine
CPU Cores
tag engine
CPU Cores
tag engine
CPU Cores
tag engine
CPU Cores
tag engine
CPU Cores
- Many concurrent queries, few cores
- Measure all the things!
- Optimal parallelism: 2 threads
- Better, but same order of magnitude
var results = new int[50];
var count = 0;
var relatedTags = new Dictionary<string, int>();
for (var i = 0; i < Index.length /* 10,000,000 */; i++)
{
if (matcher.Match(Posts[Index[i]]) {
count++;
if (count<51) {
results[count-1] = Posts[Index[i]];
}
}
for(int t = 0; t<5; t++) {
relatedTags[Posts[Index[i]].Tags[t]] =
relatedTags[Posts[Index[i]].Tags[t]] + 1;
}
}
public class Matcher {
public bool NonClosedOnly { get; set; }
public bool NoAnswerOnly { get; set; }
public bool BountiesOnly { get; set; }
/* a few of these */
public static bool Match(Post) {
if (NonClosedOnly && Post.DateClosed != null) return false;
if (NoAnswerOnly && Post.Answers > 0) return false;
if (BountiesOnly && Post.BountyAmount == 0) return false;
/** a few of these **/
return true;
}
}
public interface IMatcher { public bool Match(Post post) }
public class MatcherGenerator {
public bool NonClosedOnly { get; set; }
public bool NoAnswerOnly { get; set; }
public bool BountiesOnly { get; set; }
/* a few of these */
public Func<Post, bool> GenerateMatch() {
var il = IL.Implement(typeof(IMatcher).GetMethod(nameof(IMatcher.Match)));
Label exclude = il.DefineLabel();
if (NonClosedOnly) {
EmitMemberGet(il, nameof(QuestionValue.IsClosed));
il.Emit(OpCodes.Brtrue, exclude);
}
if (NoAnswersOnly) {
EmitMemberGet(il, nameof(QuestionValue.AnswerCount));
il.Emit(OpCodes.Brtrue, exclude);
}
if (BountiesOnly) {
EmitMemberComparison(il, nameof(QuestionValue.BountySize), null);
il.Emit(OpCodes.Brtrue, exclude);
}
/* a few of these */
il.EmitLabel(exclude); /* etc */
return il.CreateDelegate();
}
}
Just approximate, right?
Many known cases where the top N results are not representative of the rest of the dataset
e.g. "Questions with bounties sorted by newest"
MongoDB is web scale, right?
MongoDB is web scale, right?
Not fast enough
Preliminary results
Time from 1500ms down to 700 (worst cases), but many hits <100ms
tag engine v2
CPU Cores
tag engine v2
CPU Cores
tag engine v2
CPU Cores
tag engine v2 GPU
CPU Cores
GPU Cores
tag engine v2
CPU Cores
GPU Cores
Simple query
tag engine v2
CPU Cores
GPU Cores
Complex query
tag engine v2
CPU Cores
GPU Cores
Complex query
Simple query
Benchmarks
http://blog.marcgravell.com/2016/05/how-i-found-cuda-or-rewriting-tag.html
Marc Gravell
- Measure everything
- Don't assume anything
- Assume every canned recipe is wrong
- Keep on evolving
...the moral of the story
Performance tuning Stack Overflow Tags
By Marco Cecconi
Performance tuning Stack Overflow Tags
- 412