Exploring Corporate Digital Footprints: NLP Insights fromWebSweep's

High-Speed Web Scraping

Peter Gerbrands

Javier Garcia-Bernardo

KvK: Annual Reports (PDF)

FBB: Extract Financial Info

KvK: Company Information

SIDN: ".nl" 
Registrations

WebSweep: 
Scrape websites

FBB: Site text and extra info

LISA: 
Employment

FBB: Linked Data "Backbone"

BIG

PICTURE

WebSweep

Consolidates, for each base URL, the information of different pages

A User-Friendly and High-Speed Web Scraping Library

For each base URL, clicks and downloads up to 100 pages in parallel (1 second wait for each request per domain)

Output: HTML pages of each base URL

Input: List of base URLs

Crawler

Extract

Consolidate

For each page (HTML), extracts key information (clean text, identifiers, etc)

Modular

Example code (as a library)

Example code (as CLI)

Other features:

  • Decide which pages to download based on the url
  • Retry failed downloads in different ways
  • Extract and crawl at the same time (avoid saving HTML files)
  • Respects robots.txt (legal)
  • Allows to add address of IP and host (but too slow)

Limitations:

  • Performance declines with 50,000+ domains (~1M pages)
  • Only HTTPS, no JavaScript, no PDFs (yet)

1. Scrape corporate websites at regular times:

  • ~2,000,000 .nl domains (from SIDN), each with dozens of pages
  • Find and scrape URLs of Dutch websites outside of the .nl domain

2. Extract useful information

3. Link it to KvK data

Scraping goals of FIRMBACKBONE

Today:

  • Performance/statistics on sample: 80,000 domains
  • Three potential applications:
    • Query database through keywords
    • Query database through Large Language Models
    • Clustering companies using Large Language Models

 

Performance

(on a sample)

Statistics: errors

80,000 domains (364,426 pages) downloaded in ~18 hours

~20,000 pages/hour = 5 pages/second

Domain level

3% retried and corrected (automatically)

40% broken domains (or not secure)

20% problems with scraper (could be complemented)

6% retried and corrected (automatically)

1% broken links (or not secure)

4.5% problems with scraper (could be complemented)

Page level (of sucessful domains)

 

Statistics: errors

Statistics: performance

Personal computer (limited by Internet)

Server (disk may not be fast enough)

Data:

Out of the 28,671 domains :

  • 25,582 had more than 100 characters. Of those:
    • 6,257 had a KvK number --> We could match directly to KvK data
      • 5,361 matched on primary postcode
    • And additional 12,089 had some address --> Could be matched
    • 7,236 had no of little additional data
      • Many sites do not have any contact info (apart from email/form)
      • Many sites are down and have a standard error message
      • Some important pages were not downloaded by the scraper

 

 

Statistics: matching to companies

SIDN comparison:

Out of the 28,671 domains downloaded, 5,831 were linked to a KvK number in the original data (SIDN). Of those:

  • Found matching KvK number by WebSweep: 3,898 pages
  • Different KvK number found: 100 pages
  • No KvK number found: 1,833. From a hand label of a small sample:
    • 1/3 of domains did not show the KvK
    • 1/3 wrote the KvK number only in PDFs
    • 1/3 had other problems (weird technologies, zero-width spaces)

 

KvK number only in the scraped data: 2,359 domains

 

 

 

 

Statistics: matching to companies

Example Analyses

Querying FIRMBACKBONE database

Imagine you are interested in companies dealing with horses

You want to test if climate change is having an impact in the companies raising horses

 

You could check KvK data by sectors, but:

  • Sectors can be coarse:
    • Agriculture 01.49: "Other animal production".
    • Recreation Code 93.19: "Other sports activities"
  • Companies are often involved in several sectors, and owned by financial companies

 

Solution: We could find horse companies by their website, and link to KvK data for analysis

Approach 1: Querying using keywords

Convert each website to a vector:

  • Elements in the vector = Words
  • Weight of the words = TF-IDF
    • TF: Term frequency: how often the word appears
    • IDF: Inverse document frequency: how widespread the word is

Keywords: Words with the highest weight

 

Horse domains:

Approach 2: Querying using LLM

Convert each website to a vector:

  • Vector = Embedding created by a large language model (in this case S-BERT multilingual)

 

Similar vectors = similar meaning in the text

Approach 2: Querying using LLM

Instead of querying by the presence of keywords, we can query by meaning

query = "Websites die sushi verkopen bieden een scala aan Japanse delicatessen, waaronder diverse soorten sushi, zoals nigiri, maki en sashimi."

We can:

  • Convert query to vector using the same model
  • Find nearest neighbors to that vector

 

 

Clustering websites

- Retrieve n closest to description

We have vectors for each website --> We can use them for clustering (maybe eventually to create our own sectors).

2D projection using PaCMAP and clusters

Keywords extracted using cTF-IDF

WebSweep needs some adjustments:

  • Tweak concurrency to avoid errors
  • Websites are very diverse

Interested in using/contributing to WebSweep? Send an email: javier.gbe@pm.me

 

Scraped data connected to financial data holds great potential:

  • Find companies to analyze:
    • Sectors that are not well-defined
  • Analyze specific companies (e.g. large-companies in energy owned by states)
    • How corporations display themselves to the public: framing
  • Track topics over time
  • Test the impact of policy
  • Detecting innovation
  • ...

 

 

 

 

Conclusions

  • Finding URL of companies:
    • Which browser works better?
    • Small test with 1,287 URLS:
      • 566 same url in Orbis, DDG and SIDN
      • 239 different url in all sources
      • 222 same in Orbis and DDG, different in SIDN
      • 135 same in Orbis and SIDN, different in SIDN
      • 125 same in SIDN and DDG, different in Orbis
  • We then scraped all websites to see which method shows the highest correlation between kvk reported and kvk in website.
    • DDG match KvK on 54% --> only using name!
    • Orbis match KvK on 61% --> kvk/website reported by company
    • SIDN match KvK on 70% --> kvk/website reported by company

General: Which source to trust? e.g., 1/3 of websites have different postcodes than those in Orbis/KvK

Potential discussion points

  • Redirects and URL changes:

    • Follow the redirect and link to previous?
    • How to detect if it's the same company?
  • What features are useful for research?
    • What would be useful for FIRMBACKBONE
    • Keep HTML, cleaned text, or just features?
  • Representation: How do we make sure we got all the websites linked to horses/sushi, etc

 

Potential discussion points

backbone

By Javier GB

backbone

  • 44