티스토리 수익 글 보기
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 71
Quantifying the Commons: measure the size and diversity of the commons–the collection of works that are openly licensed or in the public domain
License
creativecommons/quantifying
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Quantifying the Commons: measure the size and diversity of the commons–the collection of works that are openly licensed or in the public domain
This project seeks to quantify the size and diversity of the creative commons legal tools. We aim to track the collection of works (articles, images, publications, etc.) that are openly licensed or in the public domain. The project automates data collection from multiple data sources, processes the data, and generates meaningful reports.
The Creative Commons team is committed to fostering a welcoming community. This project and all other Creative Commons open source projects are governed by our Code of Conduct. Please report unacceptable behavior to conduct@creativecommons.org per our reporting guidelines.
See CONTRIBUTING.md.
- Fetch: This phase involves collecting data from a particular source
using its API. Before writing any code, we plan the analyses we want to
perform by asking meaningful questions about the data. We also consider API
limitations (such as query limits) and design a query strategy to work
within these limitations. Then we write a python script that gets the data,
it is quite important to follow the format of the scripts existing in the
project and use the modules and functions where applicable. It ensures
consistency in the scripts and we can easily debug issues might arise.
- Meaningful questions
- The reports generated by this project (and the data fetched and
processed to support it) seeks to be meaningful. We hope this project
will provide data and analysis that helps inform discussions about the
commons. The goal of this project is to help answer questions like:
- How has the world’s use of the commons changed over time?
- How is the knowledge and culture of the commons distributed?
- Who has access (and how much) to the commons?
- What significant trends can be observed in the commons?
- Which public domain dedication or licenses are the most popular?
- What are the correlations between public domain dedication or licenses and region, language, domain/endeavor, etc.?
- The reports generated by this project (and the data fetched and
processed to support it) seeks to be meaningful. We hope this project
will provide data and analysis that helps inform discussions about the
commons. The goal of this project is to help answer questions like:
- Limitations of an API
- Some data sources provide APIs with query limits (it can be daily or hourly) depending on what is given in the documentation. This restricts how many requests that can be made in the specified period of time. It is important to plan a query strategy and schedule fetch jobs to stay within the allowed limits.
- Headings of data in 1-fetch
- Tool identifier: A unique identifier used to distinguish each Creative Commons legal tool within the dataset. This helps ensure consistency when tracking tools across different data sources.
- SPDX identifier: A standardized identifier maintained by the Software Package Data Exchange (SPDX) project. It provides a consistent way to reference licenses in applications.
- Meaningful questions
- Process: In this phase, the fetched data is transformed into a structured and standardized format for analysis. The data is then analyzed and categorized based on defined criteria to extract insights that answer the meaningful questions identified during the 1-fetch phase.
- report: This phase focuses on presenting the results of the analysis. We generate graphs and summaries that clearly show trends, patterns, and distributions in the data. These reports help communicate key insights about the size, diversity, and characteristics of openly licensed and public domain works.
For automating these phases, the project uses Python scripts to fetch, process, and report data. GitHub Actions is used to automatically run these scripts on a defined schedule and on code updates. It handles script execution, manages dependencies, and ensures the workflow runs consistently.
- Script assumptions
- Execution schedule for each quarter:
- 1-Fetch: first month, 1st half of second month
- 2-Process: 2nd half of second month
- 3-Report: third month
- Execution schedule for each quarter:
- Script requirements
- Must be safe
- Scripts must not make any changes with default options
- Easiest way to run script should also be the safest
- Have options spelled out
- Must be timely
- Scripts should complete within a maximum of 45 minutes
- Scripts shouldn’t take longer than 3 minutes with default options
- That way there’s a quicker way to see what is happening when it is running; see execution, without errors, etc. Then later in production it can be run with longer options
- Must be idempotent
- Idempotence – Wikipedia
- This applies to both the data fetched and the data stored. If the data changes randomly, we can’t draw meaningful conclusions.
- Balanced use of third-party libraries
- Third-party libraries should be leveraged when they are:
- API specific (google-api-python-client, internetarchive, etc.)
- Third-party libraries should be leveraged when they are:
- Must be safe
- File formats
- CSV: the format is well supported (rendered on GitHub, etc.), easy to use, and the data used by the project is simple enough to avoid any shortcomings.
- YAML: prioritizes human readability which addresses the primary costs and risks associated with configuration files.
Please note that in the directory tree below, all instances of fetch,
process, and report are referring to the three phases of data gathering,
processing, and report generation.
Quantifying/
├── .github/
│ ├── workflows/
│ │ ├── fetch.yml
│ │ ├── process.yml
│ │ ├── report.yml
│ │ └── static_analysis.yml
├── data/ # Data generated by script runs
│ ├── 20XXQX/
│ │ ├── 1-fetch/
│ │ ├── 2-process/
│ │ ├── 3-report/
│ │ │ └── README.md # All generated reports are displayed in the README
│ └── ...
├── dev/
├── pre-automation/ # All Quantifying work prior to adding automation system
├── scripts/ # Run scripts for all phases
│ ├── 1-fetch/
│ ├── 2-process/
│ ├── 3-report/
│ ├── plot.py # Data visualizations with matplotlib
│ └── shared.py
├── .cc-metadata.yml
├── .flake8 # Python tool configuration
├── .gitignore
├── .pre-commit-config.yaml # Static analysis configuration
├── LICENSE
├── Pipfile # Specifies the project's dependencies and Python version
├── Pipfile.lock
├── README.md
├── env.example
├── history.md
├── pyproject.toml # Python tools configuration
└── sources.md
For information on learning and installing the prerequisite technologies for this project, please see Foundational technologies — Creative Commons Open Source.
This repository uses pipenv to manage the required Python modules:
- Install
pipenv:- Linux: Installing Pipenv
- macOS:
- Install Homebrew
- Install pipenv:
brew install pipenv
- Windows: Installing Pipenv
- Create the Python virtual environment and install prerequisites using
pipenv:pipenv sync --dev
Client credentials should be stored in an environment file:
- Copy the contents of the
env.examplefile in the script’s directory to.env:cp env.example .env
- Uncomment the variables in the
.envfile and assign values as needed. Seesources.mdon how to get credentials:GCS_DEVELOPER_KEY = your_api_key GCS_CX = your_pse_id - Save the changes to the
.envfile.
You should now be able to run scripts that require client credentials without
any issues. The .env file is ignored by git to help ensure sensitive data is
not distributed.
All of the scripts should be run from the root of the repository using pipenv. For example:
pipenv run ./scripts/1-fetch/github_fetch.py -hWhen run this way, the shared library (scripts/shared.py) provides easy access
to all of the necessary paths and all of the modules managed by pipenv are
available.
Static analysis tools ensure the codebase adheres to consistent formatting and style guidelines, enhancing readability and maintainability. Also see GitHub Actions, below.
Using pre-commit
Pre-commit allows for static analysis tools (black, flake8, isort, etc.)
to be run manually or with every commit:
- (Pre-commit is installed by completing Create the Python virtual environment and install prerequisites, above)
- Install or run manually
- Install the git hook scripts to enable automatic execution on every commit
pipenv run pre-commit install
- Run manually using helper dev script:
If no file(s) are specified, then it runs against all files:
./dev/check.sh [FILE]
./dev/check.sh
- Install the git hook scripts to enable automatic execution on every commit
- (Optional) review the configuration file:
.pre-commit-config.yaml
- Python Guidelines — Creative Commons Open Source
- Black: the uncompromising Python code formatter
- flake8: a python tool that glues together pep8, pyflakes, mccabe, and third-party plugins to check the style and quality of some python code.
- isort: A Python utility / library to sort imports
- (It doesn’t import any libraries, it only sorts and formats them.)
- ppypa/pipenv: Python Development Workflow for Humans.
- pre-commit: A framework for managing and maintaining multi-language pre-commit hooks.
- Logging: Utilize the built-in Python logging module to implement a flexible logging system from a shared module.
The .github/workflows/python_static_analysis.yml
GitHub Actions workflow performs static analysis (black, flake8, and
isort) on committed changes. The workflow is triggered automatically when you
push changes to the main branch or open a pull request.
Kindly visit the sources.md file for it.
For information on past efforts, see history.md.
LICENSE: the code within this repository is licensed under the
Expat/MIT license.
The data within this repository is dedicated to the public domain under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.
The documentation within the project is licensed under a Creative Commons Attribution 4.0 International License.
About
Quantifying the Commons: measure the size and diversity of the commons–the collection of works that are openly licensed or in the public domain

