How I started with Tumblr mappings and ended up writing a parser for a custom query language
Tumblr and
Game demand
- Initial catalog: 5546 games
- Scraped from Wiki by the data science team
- Contains game titles and wikibase IDs
Is Tumblr a suitable data source for games?
Tumblr V2 pipeline
Tumblr
S3
Posts + Likes
Dictionary post_id to parrot_id
ElasticSearch Posts
SIT
data
Mappings
First step —
use titles as a search term
- ~400k posts about games per day
- Coverage: 3986 of 5546 (~72%)
Top games by posts per day
+-----------------------+-------------------+
|game_title |posts_cnt_daily_avg|
+-----------------------+-------------------+
|D |10000 |
|Baldur's Gate |10000 |
|Forced |10000 |
|Hearts |10000 |
|Sky |10000 |
|Air |10000 |
|LoveR |10000 |
|Spider |10000 |
|Journey |10000 |
|Blood |10000 |
|Snake |9723 |
|Stray |9499 |
|720° |8781 |
|SiN |8125 |
|Bless |7893 |
|Hatred |7745 |
|The Forest |7119 |
+-----------------------+-------------------+
We need better mappings!
Improving mappings
V2 approach: phrase search
{
"query": {
"multi_match": {
"query": "Resident Evil",
"fields": [
"g", // tags
"t", // title
"h", // caption
"b" // body
],
"type": "phrase"
}
}
}
Improving mappings
V1 approach: tag search
{
"query": {
"term": {
"g.array": {
"value": "Blood game",
"case_insensitive": true
}
}
}
}
Combining queries with "must"
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "Air",
"fields": ["g","t","h","b"],
"type": "phrase"
}
},
{
"multi_match": {
"query": "visual novel",
"fields": ["g","t","h","b"],
"type": "phrase"
}
}
]
}
}
}
Works as AND
Combining queries with "should"
{
"query": {
"bool": {
"should": [
{
"term": {
"g.array": {
"value": "blood game",
"case_insensitive": true
}
}
},
{
"term": {
"g.array": {
"value": "caleb blood",
"case_insensitive": true
}
}
}
],
"minimum_should_match": "1"
}
}
}
Works as OR
Hard cases
Phrase "Obscure game"
and (
Phrase "Josh Carter"
or Phrase "Stan Jones"
or Phrase "Kenny Matthew"
or Phrase "Shannon Matthews"
or Phrase "DreamCatcher Interactive"
or Phrase "Hydravision Entertainment"
or Phrase "MC2 Microids"
)
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "Obscure game",
...
}
},
{
"bool": {
"should": [
{
"multi_match": {
"query": "Josh Carter",
...
}
},
{
"multi_match": {
"query": "Stan Jones",
...
}
},
...
],
"minimum_should_match": "1"
}
}
]
}
}
}
{
"track_total_hits": true,
"size": 100000,
"_source": {
"includes": ["i","b","g","h","r","t"]
},
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "Obscure game",
"fields": ["g","t","h","b"],
"type": "phrase"
}
},
{
"bool": {
"should": [
{
"multi_match": {
"query": "Josh Carter",
"fields": ["g","t","h","b"],
"type": "phrase"
}
},
{
"multi_match": {
"query": "Stan Jones",
"fields": ["g","t","h","b"],
"type": "phrase"
}
},
{
"multi_match": {
"query": "Kenny Matthew",
"fields": ["g","t","h","b"],
"type": "phrase"
}
},
{
"multi_match": {
"query": "Shannon Matthews",
"fields": ["g","t","h","b"],
"type": "phrase"
}
},
{
"multi_match": {
"query": "DreamCatcher Interactive",
"fields": ["g","t","h","b"],
"type": "phrase"
}
},
{
"multi_match": {
"query": "Hydravision Entertainment",
"fields": ["g","t","h","b"],
"type": "phrase"
}
},
{
"multi_match": {
"query": "MC2 Microids",
"fields": ["g","t","h","b"],
"type": "phrase"
}
}
],
"minimum_should_match": "1"
}
}
]
}
}
}
Phrase "Obscure game"
and (
Phrase "Josh Carter"
or Phrase "Stan Jones"
or Phrase "Kenny Matthew"
or Phrase "Shannon Matthews"
or Phrase "DreamCatcher Interactive"
or Phrase "Hydravision Entertainment"
or Phrase "MC2 Microids"
)
1. Phrase
2. Tag
3. And
4. Or
5. ()
What we really need from ElasticSearch
p"value"
t"tag"
t"foo" and p"bar"
t"foo" or p"bar"
t"foo" and (p"bar" or t"baz")
Custom query language!
Working with a custom query language
- We need to convert queries to an actual ElasticSearch queries
- To do that we need to parse query expressions
- For parsing we need an intermediate representation as a data structure: AST (abstract syntax tree)
- Converting from AST to query is relatively simple
- Parsing string to AST — not simple
AST
interface TumblrMappingExpr {
@Data
class Phrase implements TumblrMappingExpr {
String value;
}
@Data
class Tag implements TumblrMappingExpr {
String value;
}
@Data
class And implements TumblrMappingExpr {
List<TumblrMappingExpr> expressions;
}
@Data
class Or implements TumblrMappingExpr {
List<TumblrMappingExpr> expressions;
}
}
sealed interface TumblrMappingExpr {
record Phrase(String value) implements TumblrMappingExpr {}
record Tag(String value) implements TumblrMappingExpr {}
record And(List<TumblrMappingExpr> expressions) implements TumblrMappingExpr {}
record Or(List<TumblrMappingExpr> expressions) implements TumblrMappingExpr {}
}
Java 21
sealed trait TumblrMappingExpr
object TumblrMappingExpr {
case class Phrase(value: String) extends TumblrMappingExpr
case class Tag(value: String) extends TumblrMappingExpr
case class And(expressions: List[TumblrMappingExpr]) extends TumblrMappingExpr
case class Or(expressions: List[TumblrMappingExpr]) extends TumblrMappingExpr
}
Scala 2
enum TumblrMappingExpr {
case Phrase(value: String)
case Tag(value: String)
case And(expressions: List[TumblrMappingExpr])
case Or(expressions: List[TumblrMappingExpr])
}
Scala 3
class Phrase {
value: string;
}
class Tag {
value: string;
}
class And {
expressions: TumblrMappingExpr[];
}
class Or {
expressions: TumblrMappingExpr[];
}
type TumblrMappingExpr = Phrase | Tag | And | Or;
TypeScript
Parsing
Chomsky hierarchy for grammars
Type 0: Unrestricted Grammar — the most wide and complex
Type 1: Context-Sensitive Grammar — complex grammar, parser should have a state
Type 2: Context-Free Grammar — relatively simple grammars
Type 3: Regular Grammar — the simplest case, can be parsed with regexps
Parsing
Searching for solution
Parsing
1. Manually written parser
2. Regexps
Parsing
TumblrMappingExprGrammar
TumblrMappingExprParserTest
Parsing
Mappings examples
p"bless online"
or p"bless unleashed"
or (p"bless" and (p"mmo" or p"rpg" or p"mmorpg"))
(p"Spider-Man" or p"Spider Man")
and (p"video game" or p"game")
and p"2018"
t"Dante's Inferno game" or (
p"Dante's Inferno" and (
p"video game"
or p"Visceral Games"
or p"EA"
or p"Electronic Arts"
or p"xbox"
or p"PlayStation"
or (p"game" and p"2010")
)
)
Mapping to ElasticSearch query
TumblrMappingExprToESquery
Intermediate results with new mappings
- I generated mappings for 4572 games out of 5546 (~82.8%)
- Posts about games per day: ~115k. Was ~420k.
- Coverage: 2480 of 4565 (~54.3%). Was ~72%.
Intermediate results with new mappings
+--------------+------+
|game_title |posts |
+--------------+------+
|D |10000 |
|Baldur's Gate |10000 |
|Forced |10000 |
|Hearts |10000 |
|Sky |10000 |
|Air |10000 |
|LoveR |10000 |
|Spider |10000 |
|Journey |10000 |
|Blood |10000 |
|Snake |9723 |
|Stray |9499 |
|720° |8781 |
|SiN |8125 |
|Bless |7893 |
|Hatred |7745 |
|The Forest |7119 |
+--------------+------+
+-------------------+------+
|game_title |posts |
+-------------------+------+
|Baldur's Gate III |21429 |
|Minecraft |6729 |
|Pikmin |5168 |
|Genshin Impact |5036 |
|Undertale |4472 |
|Splatoon |3936 |
|Disco Elysium |3569 |
|The Sims |3520 |
|Deltarune |3241 |
|Animal Crossing |2990 |
|Elden Ring |2440 |
|Stardew Valley |1943 |
|Braid |1873 |
|The Last of Us |1798 |
|Fire Emblem |1349 |
|The Legend of Zelda|1347 |
|Honkai: Star Rail |1175 |
+-------------------+------+
Conclusions
- I managed to find a way to create mappings, that:
- Covers V1 cases (tag search)
- Covers V2 cases (phrase search)
- Covers even more complex cases
- Mappings itself are quite simple, intuitive, and easy to write and read.
Conclusions
- Not every day you have an opportunity to work on a custom query language.
- It was a cool, interesting, and technically difficult task.
- Thanks to a modern tooling (ChatGPT, Parser Combinators libraries) it was relatively simple.
- Cool that Parrot provides an opportunity to work on a such task.
Thank you!
Tumbr mappings
By Yury Badalyants
Tumbr mappings
- 100