Twitter's recommendation algorithm

  • Twitter published some version of their "For You" homepage algorithm last week
  • No model weights or data
  • Includes names of features, labels, aggregation of labels into scores, and (hard to interpret) ranking logic

Please note we have force-pushed a new initial commit in order to remove some publicly-available Twitter user information. Note that this process may be required in the future.

Open Sourcing "The Algorithm"

  • Notes from blog post for overall structure
  • Filling in details based on code in repository

Methods

  • Removing safety labels for Ukraine
  •  Removing tracking Democrats, Republicans, Elon (and powerusers)

Notable commits so far

How is my feed populated?

  1. Candidate generation
  2. Lightweight ranking to narrow pool
  3. "Heavy ranker" to sort
  4. Heuristics and filtering

Overview

Tweet Sources

  • Tweets from users I follow (around 50% of feed)
  • Tweets from interaction network (around 15% of feed)
  • Tweets with similar embedding to me (around 35% of feed)
  • Always filtered to ensure no more than two degrees of separation

Tweet Sources

  • Tweets from users I follow (around 50% of timeline)
  • Tweets from interaction network (around 15% of timeline)
    • From real time graph between users and Tweets
    • Tweets that people I follow engaged with
    • Tweets engaged with by people who engage with similar Tweets as me
  • Tweets with similar embedding to me (around 35% of timeline)
    • e.g. SimClusters which uses a matrix factorization algorithm on a large binary matrix of users/tweets (based on likes?)
    • (not sure how use) Twihn learns embeddings based on graph data (follow, likes), have released anonymized and smaller version on HuggingFace

SimClusters

  • paper
  • Sparse and interpretable
  • Latent factors are communities, determined by a small number of major influencers
  • Scale: \(10^9\) users, \(10^{11}\) edges between them, \(10^8\) new tweets per day, \(10^9\) user engagements per day.
  • Representation space is \(10^5\): communities of around 100 users
  • Step 1: "known for" one hot
  • Step 2: "interested in" based on 1
  • Step 3: producer embedding from 2
  • Online tweet embedding is sum of interested in embeddings
  •  

Lightweight model

  • Logistic regression model
  • Last trained/updated several years ago
    • still has the feature flag "is vine"
  • Largely unpersonalized, most features pertain to tweet rather than to me
  • Predicts engagement: is_clicked, is_favorited, is_open_linked, is_photo_expanded, is_profile_clicked, is_replied, is_retweeted, is_video_playback_50
  • Score is weighted sum of predictions
  • Top ~1500 scoring Tweets go to next step

First stage

  • Logistic regression predicts relevance of tweets from candidate pool (slightly different features/models for in and out of network tweets)
    • last updated/trained several years ago
  • Real Graph predicts likelihood of engagement between two users
    • features about edge (timeseries interactions) and each user (activity and reputation)
    • unclear if this is used in the light ranker, or as a candidate generation step
  • Results in ~1500 tweets

Logistic regression

  • labels: is_clicked, is_favorited, is_open_linked, is_photo_expanded, is_profile_clicked, is_replied, is_retweeted, is_video_playback_50
  • highlights from features
    • weighted and decayed engagement (likes/replies)
    • text analysis (length, readability, offensiveness - static; toxicity - realtime)
    • encoded_tweet_features.has_vine_flag RIP
    • properties of tweet author, like "tweepcred" (pagerank for users as producers)
    • doesn't seem very personalized to the consumer beyond language/time

Heavy Ranker

  • Deep network (MaskNet) ~48M parameter
  • Thousands of features:
    • historical engagement between various relevant entities aggregated over various timescales
    • static properties of tweet (text, media)
    • embeddings of user, tweet (generated by link prediction)
  • Output: engagement
    probabilities

Second stage

  • Rank candidate tweets on basis of score (relevance)
  • ~48M parameter deep network outputs one of ten labels representing the probability of an engagement
    • "optimize for positive engagement (e.g. Likes, Retweets, and Replies)"
    • thousands of features

Heavy Ranker Inputs

  • Features aggregate over short and long timescales:
    • author features, tracking how their tweets are engaged with on different timescales
    • author-topic engagement on 50 day scale
    • user-list engagement on short and long scale
    • engagement by user by tweet property over different timescales
    • engagements between user and authors over different timescales
    • interactions between user and other engagers in the same tweets on long scale
    • user-topic engagement for inferred topics 50day
    • user-media interaction 50 day
    • interactions between user and other tweet mentioning the users mentioned in the tweet in question? 50 day
    • aggregate user engagements over same day of week / hour of day 50d
    • user-topic engagements (by tweet property) on 50 day
    • aggregate engagement on the tweet on multiple timescales

Heavy Ranker Inputs

  • Features that aren't aggregates:
    • which two-hop relationship(s) describes tweet author and user
    • network properties about author and user from realgraph
    • also about users mentioned in the tweet, in-network engagers, and upstream authors
    • tweet's features (engagements, "has media", "has question"), users device
    • features about upstream tweet, if its a reply
    • same features as light ranker
    • more realtime user-author interaction features (less of them?)
    • similarity of tweet to user's recent engaged tweets "sim_clusters"

    • other misc features about time, request context, author health

    • Thwin embeddings: directional follow embeddings, engagement embedding, each 200 dimensional

TwHIN Embeddings

  • Learned by link prediction with log likelihood of edge \(r\) between source \(s\) and target \(t\) parametrized as $$f(s,r,t) = (\theta_s+\theta_r)^\top \theta_t$$
  • Paper

Heavy Ranker Model

categorical input fields are embedded and numerical ones are compressed to \(k\) dimensions

  • \([x_1,x_2,\dots,x_f] \to [e_1,\dots, e_f]\)

Heavy Ranker Outputs

  • Outputs binary predictions, which are weighted into a score and then ranked (later reranked)
    • favorite: 0.5
    • max(click and engage, click and linger 2min): 11
    • negative feedback (block, mute, "show less often"): -74
    • click to profile and engage: 12
    • reply: 27
    • reply with something the author engages with: 75
    • report: -369
    • retweet: 1
    • watch half a video: 0.005

Heavy Ranker Score

  • Score: weighted sum of predicted engagement probabilities:
    • favorite: 0.5
    • max(click and engage, click and linger 2min): 11
    • negative feedback (block, mute, "show less often"): -74
    • click to profile and engage: 12
    • reply: 27
    • reply with something the author engages with: 75
    • report: -369
    • retweet: 1
    • watch half a video: 0.005

Heuristics

  • Remove Tweets based on my explicit filters, negative feedback
  • Prevent too many consecutive Tweets from same user
  • Balance in- and out- of network tweets
  • If Tweet is a Reply, thread together with original Tweet
  • Include ads, Follow recommendations, and onboarding prompts

Heuristics

  • Remove Tweets based on my explicit filters
  • Prevent too many consecutive Tweets from same user
  • Balance in- and out- of network tweets
    • is the mix of out- because embedding is more effective?
  • Lower score of tweets which have received negative feedback from me
  • Exclude Tweets without 2nd degree connection (follow or engagement by someone you follow)
  • If Tweet is a Reply, thread together with original Tweet
  • Update Tweets which have been edited
  • Include ads, Follow recommendations, and onboarding prompts

Heuristics

Finally

  • Show the Tweets in order, mixed with ads, Follow recommendations, and onboarding prompts
  • The pipeline above runs approximately 5 billion times per day and completes in under 1.5 seconds on average. A single pipeline execution requires 220 seconds of CPU time, nearly 150x the latency you perceive on the app.
  • Cory Doctorow: Willing speakers should reach willing listeners.

Facebook, Twitter, TikTok, YouTube and other dominant social media platforms treat the list of people we follow as suggestions, not commands. When you identify a list of people you want to hear from, the platform uses that as training data for suggestions that only incidentally contain the messages that the people you subscribed to.

End to End Principle

Twitter's recommendation algorithm

By Sarah Dean

Private

Twitter's recommendation algorithm