Peter Gerbrands
Javier Garcia-Bernardo
KvK: Annual Reports (PDF)
FBB: Extract Financial Info
KvK: Company Information
SIDN: ".nl"
Registrations
WebSweep:
Scrape websites
FBB: Site text and extra info
LISA:
Employment
FBB: Linked Data "Backbone"
BIG
PICTURE
Consolidates, for each base URL, the information of different pages
For each base URL, clicks and downloads up to 100 pages in parallel (1 second wait for each request per domain)
Output: HTML pages of each base URL
Input: List of base URLs
Crawler
Extract
Consolidate
For each page (HTML), extracts key information (clean text, identifiers, etc)
Modular
Example code (as a library)
Example code (as CLI)
Other features:
Limitations:
1. Scrape corporate websites at regular times:
2. Extract useful information
3. Link it to KvK data
Today:
80,000 domains (364,426 pages) downloaded in ~18 hours
~20,000 pages/hour = 5 pages/second
Domain level
3% retried and corrected (automatically)
40% broken domains (or not secure)
20% problems with scraper (could be complemented)
6% retried and corrected (automatically)
1% broken links (or not secure)
4.5% problems with scraper (could be complemented)
Page level (of sucessful domains)
Personal computer (limited by Internet)
Server (disk may not be fast enough)
Out of the 28,671 domains :
Out of the 28,671 domains downloaded, 5,831 were linked to a KvK number in the original data (SIDN). Of those:
KvK number only in the scraped data: 2,359 domains
Imagine you are interested in companies dealing with horses
You want to test if climate change is having an impact in the companies raising horses
You could check KvK data by sectors, but:
Solution: We could find horse companies by their website, and link to KvK data for analysis
Convert each website to a vector:
Keywords: Words with the highest weight
Horse domains:
Convert each website to a vector:
Similar vectors = similar meaning in the text
Instead of querying by the presence of keywords, we can query by meaning
query = "Websites die sushi verkopen bieden een scala aan Japanse delicatessen, waaronder diverse soorten sushi, zoals nigiri, maki en sashimi."
We can:
- Retrieve n closest to description
We have vectors for each website --> We can use them for clustering (maybe eventually to create our own sectors).
2D projection using PaCMAP and clusters
Keywords extracted using cTF-IDF
WebSweep needs some adjustments:
Interested in using/contributing to WebSweep? Send an email: javier.gbe@pm.me
Scraped data connected to financial data holds great potential:
General: Which source to trust? e.g., 1/3 of websites have different postcodes than those in Orbis/KvK
Redirects and URL changes: