creativecommons
diff --git a/‎Pipfile‎
Lines changed: 5 additions & 19 deletions b/‎Pipfile‎
Lines changed: 5 additions & 19 deletions
diff --git a/‎Pipfile.lock‎
Lines changed: 388 additions & 1826 deletions b/‎Pipfile.lock‎
Lines changed: 388 additions & 1826 deletions
diff --git a/‎README.md‎
Lines changed: 96 additions & 23 deletions b/‎README.md‎
Lines changed: 96 additions & 23 deletions
diff --git a/‎data/2026Q1/1-fetch/gcs_1_count.csv‎
Lines changed: 1 addition & 0 deletions b/‎data/2026Q1/1-fetch/gcs_1_count.csv‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎data/2026Q1/1-fetch/gcs_2_count_by_language.csv‎
Lines changed: 1 addition & 0 deletions b/‎data/2026Q1/1-fetch/gcs_2_count_by_language.csv‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎data/2026Q1/1-fetch/gcs_3_count_by_country.csv‎
Lines changed: 1 addition & 0 deletions b/‎data/2026Q1/1-fetch/gcs_3_count_by_country.csv‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎data/2026Q1/1-fetch/github_1_count.csv‎
Lines changed: 8 additions & 0 deletions b/‎data/2026Q1/1-fetch/github_1_count.csv‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎env.example‎
Lines changed: 7 additions & 0 deletions b/‎env.example‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎scripts/1-fetch/europeana_fetch.py‎
Lines changed: 0 additions & 2 deletions b/‎scripts/1-fetch/europeana_fetch.py‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎scripts/1-fetch/gcs_fetch.py‎
Lines changed: 0 additions & 4 deletions b/‎scripts/1-fetch/gcs_fetch.py‎
Lines changed: 0 additions & 4 deletions
@@ -5,38 +5,24 @@ name = "pypi"
 
 [packages]
 babel = "*"
-flickrapi = "*"
+cachetools = "*"  # Required by google-api-python-client
+feedparser = "*"
 GitPython = "*"
 google-api-python-client = "*"
-h11 = ">=0.16.0"  # Ensure dependency is secure
 internetarchive = ">=5.5.1"
-python-iso639 = "*"
-jupyterlab = ">=3.6.7"
 matplotlib = "*"
-numpy = "*"
 pandas = "*"
-plotly = "*"
-pillow = ">=11.3.0"  # Ensure dependency is secure
-Pyarrow = "*"
 Pygments = "*"
-python-dotenv = "*"
+python-iso639 = "*"
+PyYAML = "*"
 requests = ">=2.31.0"
-seaborn = "*"
-urllib3 = ">=2.5.0"
-wordcloud = "*"
+urllib3 = ">=2.6.3"  # Ensure dependency is secure
 
 [dev-packages]
 black = "*"
-"black[jupyter]" = "*"
 flake8 = "*"
 isort = "*"
 pre-commit = "*"
 
 [requires]
 python_version = "3.11"
-
-[scripts]
-gcs_fetched = "./scripts/1-fetch/gcs_fetched.py"
-flickr_fetched = "./scripts/1-fetch/flickr_fetched.py"
-gcs_processed = "./scripts/2-process/gcs_processed.py"
-gcs_reports = "./scripts/3-report/gcs_reports.py"
@@ -1,29 +1,16 @@
-# quantifying
+# Quantifying
 
-Quantifying the Commons
+Quantifying the Commons: measure the size and diversity of the commons--the
+collection of works that are openly licensed or in the public domain
 
 
 ## Overview
 
-This project seeks to quantify the size and diversity of the commons--the
-collection of works that are openly licensed or in the public domain.
-
-
-### Meaningful
-
-The reports generated by this project (and the data fetched and processed to
-support it) seeks to be meaningful. We hope this project will provide data and
-analysis that helps inform discussions about the commons--the collection of
-works that are openly licensed or in the public domain.
-
-The goal of this project is to help answer questions like:
-- How has the world's use of the commons changed over time?
-- How is the knowledge and culture of the commons distributed?
-  - Who has access (and how much) to the commons?
-- What significant trends can be observed in the commons?
-  - Which public domain dedication or licenses are the most popular?
-  - What are the correlations between public domain dedication or licenses and
-    region, language, domain/endeavor, etc.?
+This project seeks to quantify the size and diversity of the creative commons
+legal tools. We aim to track the collection of works (articles, images,
+publications, etc.) that are openly licensed or in the public domain. The
+project automates data collection from multiple data sources, processes the
+data, and generates meaningful reports.
 
 
 ## Code of conduct
@@ -47,6 +34,93 @@ See [`CONTRIBUTING.md`][org-contrib].
 [org-contrib]: https://github.com/creativecommons/.github/blob/main/CONTRIBUTING.md
 
 
+### The three phases of generating a report
+
+1. **Fetch**: This phase involves collecting data from a particular source
+   using its API. Before writing any code, we plan the analyses we want to
+   perform by asking meaningful questions about the data. We also consider API
+   limitations (such as query limits) and design a query strategy to work
+   within these limitations. Then we write a python script that gets the data,
+   it is quite important to follow the format of the scripts existing in the
+   project and use the modules and functions where applicable. It ensures
+   consistency in the scripts and we can easily debug issues might arise.
+   - **Meaningful questions**
+     - The reports generated by this project (and the data fetched and
+       processed to support it) seeks to be meaningful. We hope this project
+       will provide data and analysis that helps inform discussions about the
+       commons. The goal of this project is to help answer questions like:
+       - How has the world's use of the commons changed over time?
+       - How is the knowledge and culture of the commons distributed?
+       - Who has access (and how much) to the commons?
+       - What significant trends can be observed in the commons?
+       - Which public domain dedication or licenses are the most popular?
+       - What are the correlations between public domain dedication or licenses
+         and region, language, domain/endeavor, etc.?
+   - **Limitations of an API**
+     - Some data sources provide APIs with query limits (it can be daily or
+       hourly) depending on what is given in the documentation. This restricts
+       how many requests that can be made in the specified period of time. It
+       is important to plan a query strategy and schedule fetch jobs to stay
+       within the allowed limits.
+   - **Headings of data in 1-fetch**
+     - [Tool identifier][tool-identifier]: A unique identifier used to
+       distinguish each Creative Commons legal tool within the dataset. This
+       helps ensure consistency when tracking tools across different data
+       sources.
+     - [SPDX identifier][spdx-identifier]: A standardized identifier maintained
+       by the Software Package Data Exchange (SPDX) project. It provides a
+       consistent way to reference licenses in applications.
+2. **Process**: In this phase, the fetched data is transformed into a
+   structured and standardized format for analysis. The data is then analyzed
+   and categorized based on defined criteria to extract insights that answer
+   the meaningful questions identified during the 1-fetch phase.
+3. **report**: This phase focuses on presenting the results of the analysis.
+   We generate graphs and summaries that clearly show trends, patterns, and
+   distributions in the data. These reports help communicate key insights about
+   the size, diversity, and characteristics of openly licensed and public
+   domain works.
+
+[tool-identifier]: https://creativecommons.org/share-your-work/cclicenses/
+[spdx-identifier]: https://spdx.org/licenses/
+
+
+### Automation phases
+
+For automating these phases, the project uses Python scripts to fetch, process,
+and report data. GitHub Actions is used to automatically run these scripts on a
+defined schedule and on code updates. It handles script execution, manages
+dependencies, and ensures the workflow runs consistently.
+- **Script assumptions**
+  - Execution schedule for each quarter:
+    - 1-Fetch: first month, 1st half of second month
+    - 2-Process: 2nd half of second month
+    - 3-Report: third month
+- **Script requirements**
+  - *Must be safe*
+    - Scripts must not make any changes with default options
+    - Easiest way to run script should also be the safest
+    - Have options spelled out
+    - Must be timely
+  - *Scripts should complete within a maximum of 45 minutes*
+    - *Scripts shouldn't take longer than 3 minutes with default options*
+    - That way there’s a quicker way to see what is happening when it is
+      running; see execution, without errors, etc. Then later in production it
+      can be run with longer options
+  - *Must be idempotent*
+    - [Idempotence - Wikipedia](https://en.wikipedia.org/wiki/Idempotence)
+    - This applies to both the data fetched and the data stored. If the data
+      changes randomly, we can't draw meaningful conclusions.
+  - *Balanced use of third-party libraries*
+    - Third-party libraries should be leveraged when they are:
+      - API specific (google-api-python-client, internetarchive, etc.)
+- File formats
+  - CSV: the format is well supported (rendered on GitHub, etc.), easy to use,
+    and the data used by the project is simple enough to avoid any
+    shortcomings.
+  - YAML: prioritizes human readability which addresses the primary costs and
+    risks associated with configuration files.
+
+
 ### Project structure
 
 Please note that in the directory tree below, all instances of `fetch`,
@@ -91,8 +165,7 @@ Quantifying/
 ```
 
 
-## Development
-
+## How to set up
 
 ### Prerequisites
 
 
@@ -0,0 +1 @@
+"PLAN_INDEX","TOOL_IDENTIFIER","COUNT"
@@ -0,0 +1 @@
+"PLAN_INDEX","TOOL_IDENTIFIER","LANGUAGE","COUNT"
@@ -0,0 +1 @@
+"PLAN_INDEX","TOOL_IDENTIFIER","COUNTRY","COUNT"
@@ -0,0 +1,8 @@
+"TOOL_IDENTIFIER","SPDX_IDENTIFIER","COUNT"
+"BSD Zero Clause License","0BSD","64052"
+"CC0 1.0","CC0-1.0","350419"
+"CC BY 4.0","CC-BY-4.0","102675"
+"CC BY-SA 4.0","CC-BY-SA-4.0","30783"
+"MIT No Attribution","MIT-0","35103"
+"Unlicense","Unlicense","406459"
+"Total public repositories","N/A","289935546"
@@ -37,3 +37,10 @@
 # https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api
 
 # GH_TOKEN =
+
+
+# Smithsonian
+
+# https://edan.si.edu/openaccess/apidocs/
+
+# DATA_GOV_API_KEY =
@@ -19,7 +19,6 @@
 
 # Third-party
 import requests
-from dotenv import load_dotenv
 from pygments import highlight
 from pygments.formatters import TerminalFormatter
 from pygments.lexers import PythonTracebackLexer
@@ -31,7 +30,6 @@
 
 # Setup
 LOGGER, PATHS = shared.setup(__file__)
-load_dotenv(PATHS["dotenv"])
 
 # Constants
 EUROPEANA_API_KEY = os.getenv("EUROPEANA_API_KEY")
 
@@ -16,7 +16,6 @@
 
 # Third-party
 import googleapiclient.discovery
-from dotenv import load_dotenv
 from googleapiclient.errors import HttpError
 from pygments import highlight
 from pygments.formatters import TerminalFormatter
@@ -31,9 +30,6 @@
 # Setup
 LOGGER, PATHS = shared.setup(__file__)
 
-# Load environment variables
-load_dotenv(PATHS["dotenv"])
-
 # Constants
 BASE_URL = "https://www.googleapis.com/customsearch/v1"
 FILE1_COUNT = shared.path_join(PATHS["data_phase"], "gcs_1_count.csv")
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"PLAN_INDEX","TOOL_IDENTIFIER","LANGUAGE","COUNT"`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"PLAN_INDEX","TOOL_IDENTIFIER","COUNTRY","COUNT"`