Twitter's recommendation algorithm

Twitter published some version of their "For You" homepage algorithm last week
No model weights or data
Includes names of features, labels, aggregation of labels into scores, and (hard to interpret) ranking logic

Please note we have force-pushed a new initial commit in order to remove some publicly-available Twitter user information. Note that this process may be required in the future.

Open Sourcing "The Algorithm"

Notes from blog post for overall structure
Filling in details based on code in repository

Methods

Removing safety labels for Ukraine
Removing tracking Democrats, Republicans, Elon (and powerusers)

Notable commits so far

How is my feed populated?

Candidate generation
Lightweight ranking to narrow pool
"Heavy ranker" to sort
Heuristics and filtering

Overview

Tweet Sources

Tweets from users I follow (around 50% of feed)
Tweets from interaction network (around 15% of feed)
Tweets with similar embedding to me (around 35% of feed)
Always filtered to ensure no more than two degrees of separation

Tweet Sources

Tweets from users I follow (around 50% of timeline)
Tweets from interaction network (around 15% of timeline)
- From real time graph between users and Tweets
- Tweets that people I follow engaged with
- Tweets engaged with by people who engage with similar Tweets as me
Tweets with similar embedding to me (around 35% of timeline)
- e.g. SimClusters which uses a matrix factorization algorithm on a large binary matrix of users/tweets (based on likes?)
- (not sure how use) Twihn learns embeddings based on graph data (follow, likes), have released anonymized and smaller version on HuggingFace

SimClusters

paper
Sparse and interpretable
Latent factors are communities, determined by a small number of major influencers
Scale: $10^9$ users, $10^{11}$ edges between them, $10^8$ new tweets per day, $10^9$ user engagements per day.
Representation space is $10^5$: communities of around 100 users
Step 1: "known for" one hot
Step 2: "interested in" based on 1
Step 3: producer embedding from 2
Online tweet embedding is sum of interested in embeddings

Lightweight model

Logistic regression model
Last trained/updated several years ago
- still has the feature flag "is vine"
Largely unpersonalized, most features pertain to tweet rather than to me
Predicts engagement: is_clicked, is_favorited, is_open_linked, is_photo_expanded, is_profile_clicked, is_replied, is_retweeted, is_video_playback_50
Score is weighted sum of predictions
Top ~1500 scoring Tweets go to next step

First stage

Logistic regression predicts relevance of tweets from candidate pool (slightly different features/models for in and out of network tweets)
- last updated/trained several years ago
Real Graph predicts likelihood of engagement between two users
- features about edge (timeseries interactions) and each user (activity and reputation)
- unclear if this is used in the light ranker, or as a candidate generation step
Results in ~1500 tweets

Logistic regression

labels: is_clicked, is_favorited, is_open_linked, is_photo_expanded, is_profile_clicked, is_replied, is_retweeted, is_video_playback_50
highlights from features
- weighted and decayed engagement (likes/replies)
- text analysis (length, readability, offensiveness - static; toxicity - realtime)
- encoded_tweet_features.has_vine_flag RIP
- properties of tweet author, like "tweepcred" (pagerank for users as producers)
- doesn't seem very personalized to the consumer beyond language/time

Heavy Ranker

Deep network (MaskNet) ~48M parameter
Thousands of features:
- historical engagement between various relevant entities aggregated over various timescales
- static properties of tweet (text, media)
- embeddings of user, tweet (generated by link prediction)
Output: engagement
probabilities

Second stage

Rank candidate tweets on basis of score (relevance)
~48M parameter deep network outputs one of ten labels representing the probability of an engagement
- "optimize for positive engagement (e.g. Likes, Retweets, and Replies)"
- thousands of features

Heavy Ranker Inputs

Features aggregate over short and long timescales:
- author features, tracking how their tweets are engaged with on different timescales
- author-topic engagement on 50 day scale
- user-list engagement on short and long scale
- engagement by user by tweet property over different timescales
- engagements between user and authors over different timescales
- interactions between user and other engagers in the same tweets on long scale
- user-topic engagement for inferred topics 50day
- user-media interaction 50 day
- interactions between user and other tweet mentioning the users mentioned in the tweet in question? 50 day
- aggregate user engagements over same day of week / hour of day 50d
- user-topic engagements (by tweet property) on 50 day
- aggregate engagement on the tweet on multiple timescales

Heavy Ranker Inputs

Features that aren't aggregates:
- which two-hop relationship(s) describes tweet author and user
- network properties about author and user from realgraph
- also about users mentioned in the tweet, in-network engagers, and upstream authors
- tweet's features (engagements, "has media", "has question"), users device
- features about upstream tweet, if its a reply
- same features as light ranker
- more realtime user-author interaction features (less of them?)
- similarity of tweet to user's recent engaged tweets "sim_clusters"
- other misc features about time, request context, author health
- Thwin embeddings: directional follow embeddings, engagement embedding, each 200 dimensional

TwHIN Embeddings

Learned by link prediction with log likelihood of edge $r$ between source $s$ and target $t$ parametrized as $$f(s,r,t) = (\theta_s+\theta_r)^\top \theta_t$$
Paper

Heavy Ranker Model

MaskNet with parameters

categorical input fields are embedded and numerical ones are compressed to $k$ dimensions

$[x_1,x_2,\dots,x_f] \to [e_1,\dots, e_f]$

Heavy Ranker Outputs

Outputs binary predictions, which are weighted into a score and then ranked (later reranked)
- favorite: 0.5
- max(click and engage, click and linger 2min): 11
- negative feedback (block, mute, "show less often"): -74
- click to profile and engage: 12
- reply: 27
- reply with something the author engages with: 75
- report: -369
- retweet: 1
- watch half a video: 0.005

Heavy Ranker Score

Score: weighted sum of predicted engagement probabilities:
- favorite: 0.5
- max(click and engage, click and linger 2min): 11
- negative feedback (block, mute, "show less often"): -74
- click to profile and engage: 12
- reply: 27
- reply with something the author engages with: 75
- report: -369
- retweet: 1
- watch half a video: 0.005

Heuristics

Remove Tweets based on my explicit filters, negative feedback
Prevent too many consecutive Tweets from same user
Balance in- and out- of network tweets
If Tweet is a Reply, thread together with original Tweet
Include ads, Follow recommendations, and onboarding prompts

Heuristics

Remove Tweets based on my explicit filters
Prevent too many consecutive Tweets from same user
Balance in- and out- of network tweets
- is the mix of out- because embedding is more effective?
Lower score of tweets which have received negative feedback from me
Exclude Tweets without 2nd degree connection (follow or engagement by someone you follow)
If Tweet is a Reply, thread together with original Tweet
Update Tweets which have been edited
Include ads, Follow recommendations, and onboarding prompts

Heuristics

I think controlled by filter and score
Also some stuff in visibility models
- downranking rules
- "ForEmergencyUseOnly.scala"
- "FreedomOfSpeechNotReach.scala"

Finally

Show the Tweets in order, mixed with ads, Follow recommendations, and onboarding prompts
The pipeline above runs approximately 5 billion times per day and completes in under 1.5 seconds on average. A single pipeline execution requires 220 seconds of CPU time, nearly 150x the latency you perceive on the app.

Cory Doctorow: Willing speakers should reach willing listeners.

Facebook, Twitter, TikTok, YouTube and other dominant social media platforms treat the list of people we follow as suggestions, not commands. When you identify a list of people you want to hear from, the platform uses that as training data for suggestions that only incidentally contain the messages that the people you subscribed to.

End to End Principle

Beauty of Collaboration

Twitter's recommendation algorithm

By Sarah Dean

Twitter's recommendation algorithm

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Twitter's recommendation algorithm

Open Sourcing "The Algorithm"

Methods

Notable commits so far

Overview

Tweet Sources

Tweet Sources

SimClusters

Lightweight model

First stage

Logistic regression

Heavy Ranker

Second stage

Heavy Ranker Inputs

Heavy Ranker Inputs

TwHIN Embeddings

Heavy Ranker Model

Heavy Ranker Outputs

Heavy Ranker Score

Heuristics

Heuristics

Heuristics

Finally

End to End Principle

Beauty of Collaboration

Twitter's recommendation algorithm

More from Sarah Dean