How We Matched You to Your CodeLabs Team

Julie Cover, SRND

Watch the slides live on your device!

1. The Problem, and Others Like It 

2. My Solution + Algorithm Design 

3. Lessons from Another Itteration

The Problem, and

Others Like It

Phase 1: Priorities and Recommendations

Phase 2: Placement

for id, project in all_project_data.items():
    if project["proj_size_remaining"] == project["num_first_choice"]:
        all_project_data, 
          student_placements = place_students_of_choice(all_project_data, student_placements, 
                                                        id, 1,
                                                        project["proj_size_remaining"])

for id, project in all_project_data.items():
    if project["proj_size_remaining"] >= project["num_first_choice"]:
        all_project_data, 
          student_placements = place_students_of_choice(all_project_data, student_placements,
                                                        id, 1,
                                                        project["proj_size_remaining"])

_all_project_data = deepcopy(all_project_data)
for id, project in _all_project_data.items():
    if project["proj_size_remaining"] >= project["num_first_choice"]:
        all_project_data, 
          student_placements = place_students_of_choice_balanced(all_project_data, 
                                                                 student_placements,
                                                                 id, [2, 15],
                                                                 project["proj_size_remaining"])

_all_project_data = deepcopy(all_project_data)
for id, project in _all_project_data.items():
    all_project_data, 
      student_placements = place_students_of_choice_balanced(all_project_data, student_placements, 
                                                             id, [1, 2], 
                                                             project["proj_size_remaining"])
for i in range(3, 16, 4):
    _all_project_data = deepcopy(all_project_data)
    for id, project in _all_project_data.items():
        if project["proj_size_remaining"] >= project["num_first_choice"]:
            all_project_data, 
              student_placements = place_students_of_choice_balanced(all_project_data,
                                                                     student_placements, id,
                                                                     [i, i + 1, i + 2, i + 3],
                                                                     project["proj_size_remaining"])

Stable Roommates Problem

  • Does not require preferences on all members
  • Only one group

Stable Marriage Problem

AKA Gale–Shapley algorithm

Stable Roommates Problem

  • Does not require preferences on all members
  • Only one group

Stable Marriage Problem

AKA Gale–Shapley algorithm

My Solution

(with a side of algorithmic design tips)

0. Data Collection

1. Elastic and Suggestions

2. APIs n' Stuff

3. Placement Algorithm

4. Manual Verification

Phase 0: Data

const mentorSchema = {
    mentor_id: "",
    name: "",
    company: "",
    bio: "",
    backgroundRural: true,
    preferStudentUnderRep: 2, (0-2)
    okExtended: true,
    timezone: -7,
    preferToolExistingKnowledge: true,
    proj_id: "",
    proj_description: "",
    proj_tags: [""],
    studentsSelected: 2,
};
const studentSchema = {
    id: "",
    name: "",
    rural: false,
    underrepresented: false,
    requireExtended: true,
    timezone: -3,
    interestCompanies: [""],
    interestTags: [""],
    beginner: true,
};

Phase 1: Matching w/ Elastic

Phase 1: Matching w/ Elastic

Elastic is great, but also the worst

  1. Docs are from another era

Phase 1: Matching w/ Elastic

Elastic is great, but also the worst

  1. Docs are from another era
  2. Python tooling is hit or miss

=

Phase 1: Matching w/ Elastic

Elastic is great, but also the worst

  1. Docs are from another era
  2. Python tooling is hit or miss
    • Elasticsearch-DSL

=

Phase 1: Matching w/ Elastic

Elastic is great, but also the worst

  1. Docs are from another era
  2. Python tooling is hit or miss
    • Elasticsearch-DSL
    • Q() Syntax

=

# Lines 56-74
company_q = None
  for company in student["interestCompanies"]:
    if company_q is None:
      company_q = Q(
        "function_score",
        query=Q("fuzzy", company=company),
        weight=company_score,
        boost_mode="replace",
      )
    else:
      company_q = company_q | Q(
        "function_score",
        query=Q("fuzzy", company=company),
        weight=company_score,
        boost_mode="replace",
      )
# 111-113
combined_query = Q(
  "function_score", 
  query=combined_query,
  functions=SF(
    "gauss", 
    numStudentsSelected={"origin": 0, 
                         "scale": 3, 
                         "offset": 3, 
                         "decay": 0.50
                        }
  )
)

Phase 1: Matching w/ Elastic

Elastic is great, but also the worst

=

    combined_query = Q(
        "function_score",
        query=combined_query,
        functions=[
            SF(
                "script_score",
                script={
                    "source": """
    int student_tz = params.student_tz;
    int mentor_tz = 0;
    // Null check. Even though timezone is required, somehow some null rows snuck in and bamboozled me
    if (doc['timezone'].size() == 0) {
        mentor_tz = 0;
    } else {
        mentor_tz = (int)doc['timezone'].value;
    }
    int diff = (int)Math.abs(student_tz - mentor_tz);

    boolean mentor_ok_tz_diff = false;
    if (doc['okTimezoneDifference'].size() == 0) {
        mentor_ok_tz_diff = false;
    } else {
        mentor_ok_tz_diff = doc['okTimezoneDifference'].value;
    }

    if (mentor_ok_tz_diff == true) {
        if (student_tz > 0) {
            // Mentor is OK with the time difference and student has a large time difference
            return 1;
        } else {
            // Mentor is ok with time difference and student has a normal time
            return 0.75;
        }
    } else {
        if (diff <= 2) {
            // Mentor is not ok with time difference and student has normal time
            return 1;
        } else if (diff == 3) {
            return 0.75;
        } else {
            // Mentor is not ok with time difference and student has weird time
            return 0;
        }
    }
    """,
                    "params": {"student_tz": student["timezone"]},
                },
            )
        ],
        boost_mode="multiply",
        score_mode="sum",
    )

2. Code in strings

3. Lack of Readability

1. Painless Scripts

1. Painless Scripts

4. Debugging is a disaster

5. Breaks the explain function

  1. Docs are from another era
  2. Python tooling is hit or miss
    • Elasticsearch-DSL
    • Q() Syntax

Elastic - What's Your Point?

  • Elastic is incredibly powerful
    • Solves many unique problems
    • Queries themselves are almost language-agnostic
  • Complexity is inevitable
  • Bad docs and language APIs aren't
  • ... probably just buy the book

Phase 2: APIs n' Stuff

It's super easy!

Great for APIs!

Phase 2: APIs n' Stuff

It's super easy!

@app.route("/matches/<student_data>", methods=["GET"])
def matches(student_data):
    try:
        data = decode(student_data, current_app.jwt_key, algorithms=["HS256"])
    except exceptions.DecodeError:
        raise Unauthorized("Something is wrong with your JWT Encoding.")
    ela_resp = evaluate_score(data, current_app.elasticsearch, 25)
    resp = [
        {"score": hit._score, "project": hit._source.to_dict()}
        for hit in ela_resp.hits.hits
    ]
    return json.dumps(resp)
# This is needed to allow the other libraries to import database, 
# as python doesn't check in the parent directory otherwise.
currentdir = os.path.dirname(os.path.abspath(
    inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0, parentdir)
# Store any Object on the App object!!
app.elasticsearch = Elasticsearch(elastic_host)
app.jwt_key = os.getenv("JWT_KEY")
-----------------
from flask import current_app
print(current_app.jwt_key)

Phase 2: APIs n' Stuff

It's super easy!

// Sample GET request input
const requestData = {
    "id": str(uuid.uuid4()),
    "name": "John Peter",
    "rural": True,
    "underrepresented": False,
    "timezone": -4,
    "interestCompanies": ['Microsoft', "Google"],
    "interestTags": ["Backend", "Data", "python", "php"],
    "requireExtended": False,
    "track": "Advanced"
}
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjE5NWY2ZDI5LTYyMjYtNDUyNC05ODJmLTc5M2ZhMThlN2ViOSIsIm5hbWUiOiJKb2huIFBldGVyIiwicnVyYWwiOnRydWUsInVuZGVycmVwcmVzZW50ZWQiOmZhbHNlLCJ0aW1lem9uZSI6LTQsImludGVyZXN0Q29tcGFuaWVzIjpbIk1pY3Jvc29mdCIsIkdvb2dsZSJdLCJpbnRlcmVzdFRhZ3MiOlsiQmFja2VuZCIsIkRhdGEiLCJweXRob24iLCJwaHAiXSwicmVxdWlyZUV4dGVuZGVkIjpmYWxzZSwidHJhY2siOiJBZHZhbmNlZCJ9.kZ_LsGQrno3kL5Gm_M9WD1ttFsHz4BO32aAZvAYt5n0

Phase 3: Placement Algorithm

1. Start with projects that have the right number of first place votes already. Assign students to those by adding their information to the saved project dictionary. Remove those student's votes from all projects to avoid duplicates, and remove the projects

2. Then, assign first choice votes to students on projects with less first choice votes than the projects need. Also those students votes from other projects. Decrement `proj_size_remaining`.

3. Then, assign second, third, and more choice votes as needed until `proj_size_remaining` = 0, them remove the project. Do this in order, all second place votes, third place votes, and so on so that students get their lowest possible choice. If multiple students are tied, be sure to assign based on which student has the fewest votes left in other projects. Also remember to remove the student from all other projects when their vote is saved.

4. Once all projects with less first choice votes than needed are dealt with, we are left with only projects that started with more than enough first choice votes. These should have exactly the correct number of first choice votes left due to how students have been removed. Assign these students, and complain loudly if something is wrong.

  1. Multiple Steps - as many as needed
  2. Verbose, clear, and actionable descriptions
  3. Build in checks and metrics
  1. Planning
  2. Revising / Breaking
  3. goto 1
  4. Implementation
  5. Patching
  6. Tuning
  7. Being Done

Phase 3: Placement Algorithm

Or: Jake's Guide to Algorithm Design

"For every n minutes of planning, you save 10n minutes implementing"

- Sun Tzu

Phase 4: Manual Cleanup - Sometimes

Not always possible!

Try to work with others

Speed Round

Thanks for Watching!

CodeDay!

(it's pretty great)

www.codeday.org

CodeLabs Matching Tech Talk

By Julie

CodeLabs Matching Tech Talk

  • 316