Skip to content

Commit 66adb2f

Browse files
committed
Merge branch 'main' into internetarchive
2 parents 03e4f8f + 925e721 commit 66adb2f

20 files changed

+843
-1955
lines changed

Pipfile

Lines changed: 5 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -5,38 +5,24 @@ name = "pypi"
55

66
[packages]
77
babel = "*"
8-
flickrapi = "*"
8+
cachetools = "*" # Required by google-api-python-client
9+
feedparser = "*"
910
GitPython = "*"
1011
google-api-python-client = "*"
11-
h11 = ">=0.16.0" # Ensure dependency is secure
1212
internetarchive = ">=5.5.1"
13-
python-iso639 = "*"
14-
jupyterlab = ">=3.6.7"
1513
matplotlib = "*"
16-
numpy = "*"
1714
pandas = "*"
18-
plotly = "*"
19-
pillow = ">=11.3.0" # Ensure dependency is secure
20-
Pyarrow = "*"
2115
Pygments = "*"
22-
python-dotenv = "*"
16+
python-iso639 = "*"
17+
PyYAML = "*"
2318
requests = ">=2.31.0"
24-
seaborn = "*"
25-
urllib3 = ">=2.5.0"
26-
wordcloud = "*"
19+
urllib3 = ">=2.6.3" # Ensure dependency is secure
2720

2821
[dev-packages]
2922
black = "*"
30-
"black[jupyter]" = "*"
3123
flake8 = "*"
3224
isort = "*"
3325
pre-commit = "*"
3426

3527
[requires]
3628
python_version = "3.11"
37-
38-
[scripts]
39-
gcs_fetched = "./scripts/1-fetch/gcs_fetched.py"
40-
flickr_fetched = "./scripts/1-fetch/flickr_fetched.py"
41-
gcs_processed = "./scripts/2-process/gcs_processed.py"
42-
gcs_reports = "./scripts/3-report/gcs_reports.py"

Pipfile.lock

Lines changed: 388 additions & 1826 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 96 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,16 @@
1-
# quantifying
1+
# Quantifying
22

3-
Quantifying the Commons
3+
Quantifying the Commons: measure the size and diversity of the commons--the
4+
collection of works that are openly licensed or in the public domain
45

56

67
## Overview
78

8-
This project seeks to quantify the size and diversity of the commons--the
9-
collection of works that are openly licensed or in the public domain.
10-
11-
12-
### Meaningful
13-
14-
The reports generated by this project (and the data fetched and processed to
15-
support it) seeks to be meaningful. We hope this project will provide data and
16-
analysis that helps inform discussions about the commons--the collection of
17-
works that are openly licensed or in the public domain.
18-
19-
The goal of this project is to help answer questions like:
20-
- How has the world's use of the commons changed over time?
21-
- How is the knowledge and culture of the commons distributed?
22-
- Who has access (and how much) to the commons?
23-
- What significant trends can be observed in the commons?
24-
- Which public domain dedication or licenses are the most popular?
25-
- What are the correlations between public domain dedication or licenses and
26-
region, language, domain/endeavor, etc.?
9+
This project seeks to quantify the size and diversity of the creative commons
10+
legal tools. We aim to track the collection of works (articles, images,
11+
publications, etc.) that are openly licensed or in the public domain. The
12+
project automates data collection from multiple data sources, processes the
13+
data, and generates meaningful reports.
2714

2815

2916
## Code of conduct
@@ -47,6 +34,93 @@ See [`CONTRIBUTING.md`][org-contrib].
4734
[org-contrib]: https://github.com/creativecommons/.github/blob/main/CONTRIBUTING.md
4835

4936

37+
### The three phases of generating a report
38+
39+
1. **Fetch**: This phase involves collecting data from a particular source
40+
using its API. Before writing any code, we plan the analyses we want to
41+
perform by asking meaningful questions about the data. We also consider API
42+
limitations (such as query limits) and design a query strategy to work
43+
within these limitations. Then we write a python script that gets the data,
44+
it is quite important to follow the format of the scripts existing in the
45+
project and use the modules and functions where applicable. It ensures
46+
consistency in the scripts and we can easily debug issues might arise.
47+
- **Meaningful questions**
48+
- The reports generated by this project (and the data fetched and
49+
processed to support it) seeks to be meaningful. We hope this project
50+
will provide data and analysis that helps inform discussions about the
51+
commons. The goal of this project is to help answer questions like:
52+
- How has the world's use of the commons changed over time?
53+
- How is the knowledge and culture of the commons distributed?
54+
- Who has access (and how much) to the commons?
55+
- What significant trends can be observed in the commons?
56+
- Which public domain dedication or licenses are the most popular?
57+
- What are the correlations between public domain dedication or licenses
58+
and region, language, domain/endeavor, etc.?
59+
- **Limitations of an API**
60+
- Some data sources provide APIs with query limits (it can be daily or
61+
hourly) depending on what is given in the documentation. This restricts
62+
how many requests that can be made in the specified period of time. It
63+
is important to plan a query strategy and schedule fetch jobs to stay
64+
within the allowed limits.
65+
- **Headings of data in 1-fetch**
66+
- [Tool identifier][tool-identifier]: A unique identifier used to
67+
distinguish each Creative Commons legal tool within the dataset. This
68+
helps ensure consistency when tracking tools across different data
69+
sources.
70+
- [SPDX identifier][spdx-identifier]: A standardized identifier maintained
71+
by the Software Package Data Exchange (SPDX) project. It provides a
72+
consistent way to reference licenses in applications.
73+
2. **Process**: In this phase, the fetched data is transformed into a
74+
structured and standardized format for analysis. The data is then analyzed
75+
and categorized based on defined criteria to extract insights that answer
76+
the meaningful questions identified during the 1-fetch phase.
77+
3. **report**: This phase focuses on presenting the results of the analysis.
78+
We generate graphs and summaries that clearly show trends, patterns, and
79+
distributions in the data. These reports help communicate key insights about
80+
the size, diversity, and characteristics of openly licensed and public
81+
domain works.
82+
83+
[tool-identifier]: https://creativecommons.org/share-your-work/cclicenses/
84+
[spdx-identifier]: https://spdx.org/licenses/
85+
86+
87+
### Automation phases
88+
89+
For automating these phases, the project uses Python scripts to fetch, process,
90+
and report data. GitHub Actions is used to automatically run these scripts on a
91+
defined schedule and on code updates. It handles script execution, manages
92+
dependencies, and ensures the workflow runs consistently.
93+
- **Script assumptions**
94+
- Execution schedule for each quarter:
95+
- 1-Fetch: first month, 1st half of second month
96+
- 2-Process: 2nd half of second month
97+
- 3-Report: third month
98+
- **Script requirements**
99+
- *Must be safe*
100+
- Scripts must not make any changes with default options
101+
- Easiest way to run script should also be the safest
102+
- Have options spelled out
103+
- Must be timely
104+
- *Scripts should complete within a maximum of 45 minutes*
105+
- *Scripts shouldn't take longer than 3 minutes with default options*
106+
- That way there’s a quicker way to see what is happening when it is
107+
running; see execution, without errors, etc. Then later in production it
108+
can be run with longer options
109+
- *Must be idempotent*
110+
- [Idempotence - Wikipedia](https://en.wikipedia.org/wiki/Idempotence)
111+
- This applies to both the data fetched and the data stored. If the data
112+
changes randomly, we can't draw meaningful conclusions.
113+
- *Balanced use of third-party libraries*
114+
- Third-party libraries should be leveraged when they are:
115+
- API specific (google-api-python-client, internetarchive, etc.)
116+
- File formats
117+
- CSV: the format is well supported (rendered on GitHub, etc.), easy to use,
118+
and the data used by the project is simple enough to avoid any
119+
shortcomings.
120+
- YAML: prioritizes human readability which addresses the primary costs and
121+
risks associated with configuration files.
122+
123+
50124
### Project structure
51125

52126
Please note that in the directory tree below, all instances of `fetch`,
@@ -91,8 +165,7 @@ Quantifying/
91165
```
92166

93167

94-
## Development
95-
168+
## How to set up
96169

97170
### Prerequisites
98171

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"PLAN_INDEX","TOOL_IDENTIFIER","COUNT"
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"PLAN_INDEX","TOOL_IDENTIFIER","LANGUAGE","COUNT"
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"PLAN_INDEX","TOOL_IDENTIFIER","COUNTRY","COUNT"
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
"TOOL_IDENTIFIER","SPDX_IDENTIFIER","COUNT"
2+
"BSD Zero Clause License","0BSD","64052"
3+
"CC0 1.0","CC0-1.0","350419"
4+
"CC BY 4.0","CC-BY-4.0","102675"
5+
"CC BY-SA 4.0","CC-BY-SA-4.0","30783"
6+
"MIT No Attribution","MIT-0","35103"
7+
"Unlicense","Unlicense","406459"
8+
"Total public repositories","N/A","289935546"

env.example

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,10 @@
3737
# https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api
3838

3939
# GH_TOKEN =
40+
41+
42+
# Smithsonian
43+
44+
# https://edan.si.edu/openaccess/apidocs/
45+
46+
# DATA_GOV_API_KEY =

scripts/1-fetch/europeana_fetch.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@
1919

2020
# Third-party
2121
import requests
22-
from dotenv import load_dotenv
2322
from pygments import highlight
2423
from pygments.formatters import TerminalFormatter
2524
from pygments.lexers import PythonTracebackLexer
@@ -31,7 +30,6 @@
3130

3231
# Setup
3332
LOGGER, PATHS = shared.setup(__file__)
34-
load_dotenv(PATHS["dotenv"])
3533

3634
# Constants
3735
EUROPEANA_API_KEY = os.getenv("EUROPEANA_API_KEY")

scripts/1-fetch/gcs_fetch.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@
1616

1717
# Third-party
1818
import googleapiclient.discovery
19-
from dotenv import load_dotenv
2019
from googleapiclient.errors import HttpError
2120
from pygments import highlight
2221
from pygments.formatters import TerminalFormatter
@@ -31,9 +30,6 @@
3130
# Setup
3231
LOGGER, PATHS = shared.setup(__file__)
3332

34-
# Load environment variables
35-
load_dotenv(PATHS["dotenv"])
36-
3733
# Constants
3834
BASE_URL = "https://www.googleapis.com/customsearch/v1"
3935
FILE1_COUNT = shared.path_join(PATHS["data_phase"], "gcs_1_count.csv")

0 commit comments

Comments
 (0)