Skip to content

mheadd/s70-api-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Schedule 70 Data Scraper

A data scraper for getting information from the GSA eLibrary site.

Background on Schedule 70 Data

GSA eLibrary schedules and contracts data includes a lot more than just schedule 70 and does not identify which contractors are authorized to work with state and local governments. It's also not clear how current this data is, or how often it is updated.

The GSA eLibrary site does allow for data to be downloaded, ostensibly in Excel format. However, the downloaded file appears to be HTML, not .xls. Also, system errors are encountered periodically when attempting to download data in this way.

For now, scraping the data from the eLibrary site gets around these limitations, and also provides useful data elements that are not available in the downloadable version (even if it we're working properly), like the URL to link to a specific contractor's profile information.

Usage

This is admittedly ugly and inefficient, but it works. Suggestions for improvements welcomed. You'll need Node.js, csvkit and MySQL installed to run this script.

  • Clone this repo and install dependencies: npm install
  • Create a new MySQL database and user:
mysql> CREATE DATABASE schedule70;
mysql> CREATE USER 'user'@'localhost' IDENTIFIED BY 'password';
mysql> GRANT CREATE, INSERT ON schedule70 . * TO 'user'@'localhost';
  • Make the scraper script executable: chmod +x scraper.sh
  • Run the scraper script and pass in your MySQL credentials:
~$ ./scraper.sh "user" "password" "localhost" "schedule70"

Note - you can modify the the categories that get scraped by modifying the array in the scraper script.

Once the data is scraped, you can export it from MySQL into a CSV file like this:

~$ mysql -u user -p < sql/export.sql

About

A data scraper for GSA scheudle 70 vendor data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published