Marco Cecconi
@sklivvz http://sklivvz.com
windows*
Find all questions tagged "javascript" but not "jquery"
SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
SELECT TOP 50 Tag1.PostId, LastActivityDate
FROM (
SELECT p.Id PostId, LastActivityDate
FROM Posts p
JOIN PostTags pt ON p.Id = pt.PostId
WHERE pt.TagId = 3
) Tag1
LEFT JOIN (
SELECT p.Id PostId
FROM Posts p
JOIN PostTags pt ON p.Id = pt.PostId
WHERE pt.TagId = 820
) Tag2 ON Tag1.PostId = Tag2.PostId
WHERE Tag2.PostId IS NULL
ORDER BY LastActivityDate DESC
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
GOTable 'PostTags'. Scan count 5, logical reads 6344, [...]
Table 'Posts'. Scan count 5, logical reads 302062, [...]
SQL Server Execution Times: CPU time = 5173 ms, elapsed time = 2194 ms. Related Tags:
Find the top 10 tags in those questions
SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
SELECT TOP 10 PostTags.TagId, COUNT(1) as Number
FROM (
SELECT p.Id PostId, LastActivityDate
FROM Posts p
JOIN PostTags pt ON p.Id = pt.PostId
WHERE pt.TagId = 3
) Tag1
LEFT JOIN (
SELECT p.Id PostId
FROM Posts p
JOIN PostTags pt ON p.Id = pt.PostId
WHERE pt.TagId = 820
) Tag2 ON Tag1.PostId = Tag2.PostId
JOIN PostTags ON PostTags.PostId = Tag1.PostId
WHERE Tag2.PostId IS NULL
AND PostTags.TagId NOT IN (3, 820)
GROUP BY PostTags.TagId
ORDER BY 2 DESC
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
GOTable 'PostTags'. Scan count 15, logical reads 169532, [...]
read-ahead reads 122731 [...]
Table 'Posts'. Scan count 10, logical reads 329038, [...]
SQL Server Execution Times:
CPU time = 18875 ms, elapsed time = 5399 ms. SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
SELECT TOP 100 Id, LastActivityDate
FROM Posts
WHERE PostTypeId = 1
AND CONTAINS(Tags, '"javascript" AND NOT "jquery"')
ORDER BY 2 DESC
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
GOTable 'Posts'. Scan count 5, logical reads 179483, [...]
SQL Server Execution Times:
CPU time = 5202 ms, elapsed time = 2044 ms. CONTAINS (tags, 'éûnetà AND écñà')
The tag engine typically serves out queries at about 1ms per query. Our questions list page runs at an average of about 35ms per page render, over hundreds of thousands of runs a day.
-- Sam Saffron, "In Managed Code We Trust"
List<Posts>
Post[]
struct Post
Post.List<Tags>
Post.Tags[5]
Index = int[]
fixed
Just use map and reduce, right?
var results = new int[50];
var count = 0;
var relatedTags = new Dictionary<string, int>();
for (var i = 0; i < Index.length /* 10,000,000 */; i++)
{
if (matcher.Match(Posts[Index[i]]) {
count++;
if (count<51) {
results[count-1] = Posts[Index[i]];
}
}
for(int t = 0; t<5; t++) {
relatedTags[Posts[Index[i]].Tags[t]] =
relatedTags[Posts[Index[i]].Tags[t]] + 1;
}
}Just go parallel, right?
tag engine
CPU Cores
tag engine
CPU Cores
tag engine
CPU Cores
tag engine
CPU Cores
tag engine
CPU Cores
var results = new int[50];
var count = 0;
var relatedTags = new Dictionary<string, int>();
for (var i = 0; i < Index.length /* 10,000,000 */; i++)
{
if (matcher.Match(Posts[Index[i]]) {
count++;
if (count<51) {
results[count-1] = Posts[Index[i]];
}
}
for(int t = 0; t<5; t++) {
relatedTags[Posts[Index[i]].Tags[t]] =
relatedTags[Posts[Index[i]].Tags[t]] + 1;
}
}public class Matcher {
public bool NonClosedOnly { get; set; }
public bool NoAnswerOnly { get; set; }
public bool BountiesOnly { get; set; }
/* a few of these */
public static bool Match(Post) {
if (NonClosedOnly && Post.DateClosed != null) return false;
if (NoAnswerOnly && Post.Answers > 0) return false;
if (BountiesOnly && Post.BountyAmount == 0) return false;
/** a few of these **/
return true;
}
}public interface IMatcher { public bool Match(Post post) }
public class MatcherGenerator {
public bool NonClosedOnly { get; set; }
public bool NoAnswerOnly { get; set; }
public bool BountiesOnly { get; set; }
/* a few of these */
public Func<Post, bool> GenerateMatch() {
var il = IL.Implement(typeof(IMatcher).GetMethod(nameof(IMatcher.Match)));
Label exclude = il.DefineLabel();
if (NonClosedOnly) {
EmitMemberGet(il, nameof(QuestionValue.IsClosed));
il.Emit(OpCodes.Brtrue, exclude);
}
if (NoAnswersOnly) {
EmitMemberGet(il, nameof(QuestionValue.AnswerCount));
il.Emit(OpCodes.Brtrue, exclude);
}
if (BountiesOnly) {
EmitMemberComparison(il, nameof(QuestionValue.BountySize), null);
il.Emit(OpCodes.Brtrue, exclude);
}
/* a few of these */
il.EmitLabel(exclude); /* etc */
return il.CreateDelegate();
}
}Just approximate, right?
Many known cases where the top N results are not representative of the rest of the dataset
e.g. "Questions with bounties sorted by newest"
Preliminary results
Time from 1500ms down to 700 (worst cases), but many hits <100ms
tag engine v2
CPU Cores
tag engine v2
CPU Cores
tag engine v2
CPU Cores
tag engine v2 GPU
CPU Cores
GPU Cores
tag engine v2
CPU Cores
GPU Cores
Simple query
tag engine v2
CPU Cores
GPU Cores
Complex query
tag engine v2
CPU Cores
GPU Cores
Complex query
Simple query
http://blog.marcgravell.com/2016/05/how-i-found-cuda-or-rewriting-tag.html
Marc Gravell