These steps only need to be done once.
This project uses parameters which should be kept private, such as API keys. To
keep them secret they are loaded as environment variables from a .env file.
To create your own .env file, first copy the example:
cp .env.example .envThe instructions in the following sections will walk you through configuring
your .env file.
Optionally create a virtual environment and activate it using these instructions.
Install the Python dependencies by running
pip install -r requirements.txtThis project uses OpenAI's API for AI-powered web scraping.
- Create an OpenAI account and API key.
- Save the API key to the
.envfile you created above next toOPENAI_API_KEY.
Note that OpenAI imposes rate limits on newly-created accounts.
These steps are required if you want to publish the results to a Google spreadsheet. They can safely be done at a later time if you'd prefer to only store the results locally for now.
- Go to https://console.cloud.google.com/
- Select the project you wish to use or create a new one.
- Go to https://console.cloud.google.com/iam-admin/serviceaccounts
- Click "Create Service Account" near the top of the page
- Enter an ID and optionally a name and description.
- Click "Done" near the bottom. We can skip the optional steps.
- Under "Actions", click the three dots and select "Manage keys".
- Click "Add Key" and then "Create new key". Pick JSON for the key type.
- Your browser will download a JSON file containing the key. Open the file
and copy the entire contents into your
.envfile underGOOGLE_SERVICE_ACCOUNT_KEY. Be sure to keep the single quotes and curly braces in place. - Return to the Service Accounts overview page.
- Copy the email address for the Service Account. It should have the form
service-account-id@project-123456.iam.gserviceaccount.com. - Open the spreadsheet where you'd like results to be published.
- Click the "Share" button in the top left.
- Share the sheet with the service account by pasting in its email address. Be sure to set make it an Editor.
- Copy the ID for the spreadsheet from its URL. The ID is a long string of
letters and numbers.
https://docs.google.com/spreadsheets/d/vIsx_YJu2tDVscuvLIe73-HGIZ_HDqf-k7DkTzrOEErr/edit?gid=0#gid=0 # ^------- this is the spreadsheet ID -------^ - Paste the ID into your
.envfile next toGOOGLE_SPREADSHEET_ID. - Copy the name of the sheet (tab) where you want the results published. By
default, it will have the name
Sheet1. - Paste the sheet name into your
.envfile next toGOOGLE_SHEET_NAME.
Run main.py. The only required argument is the file where you'd like to save
the results.
python -m scraper output.csvFor more information, run
python -m scraper --helpThis project can publish the scrape results to a Google spreadsheet to make them easy to share. Before proceeding, make sure you've followed the setup instructions above.
Like the main module, there's one required argument: the CSV file containing the results.
python -m scraper.common.exporters.google_sheets output.csvFor more information, run
python -m scraper.common.exporters.google_sheets --helpThe script will append new rows to the bottom of the sheet. It checks the rows already in the sheet to avoid duplicates.
See here for maintenance information about this repository, including the GitHub Actions workflow which can automatically run the steps above.