Crawls a website looking for a token with a breadth-first algorithm. Does not cross domains.
Run the following commands:
npm inpm link
npm start- this will by default crawlwww.bbc.co.uklooking for the termRenaulton all crawled pages.
OR
npx babel-node crawl.js {startPage} {searchTerm} {parallelism}- searches a domain fromstartPage, with a custom search termsearchTerm.parallelismdefines the maximum number of concurrent requests the crawler can make to the site at any given time.
These are (some) of the things I've not yet had time to do within the time budget:
- Static user agent
- Does not take
robots.txtinto account - Application does not refill HTTP request slots immediately when they resolve - could be more efficient
- Needs Babel compilation
- Ideally needs some configurable limitations - e.g. max depth of URLs to crawl, max time to crawl for, etc
CrawlerQueuecould probably do with decompositionCrawlerQueueneeds unit tests!