Contributions
Easy Accessible
Unrestricted access in two formats: SQL database and compressed .csv files
Centralized
Integration of three digital libraries: Domínio Público, Projecto Adamastor and BLPL
Enriched Metadata
Enriched metadata for works, authors and online reviews extracted from Goodreads API
82,313
Public Domain Works
2,388
Goodreads Works
966
Goodreads Authors
80
Goodreads Genres
4,240
Goodreads Reviews
1,430
Goodreads Readers
Abstract
Combining human expertise with book-consumers data may generate what is needed to sustain constant changes experienced in the book publishing market. Then, building and making available datasets that entirely comprise the essential elements of the book industry ecosystem is of great importance. However, little has been done in such a context regarding non-English languages, such as Portuguese. Hence, in this work, we introduce PPORTAL, a Public domain PORTuguese-lAnguage Literature dataset composed of books-related metadata. Besides a high-level overview of the building process and dataset content, we provide a brief exploratory data analysis to summarize its main characteristics. Moreover, we highlight possible applications to be elaborated, showing how PPORTAL can be used as a resource within different research domains.
PPORTAL
Public domain Portuguese-language literature Dataset
All collected and enriched data are available in three separate versions (Preliminary, Goodreads, and Full). Hence, we generate a dump file for each version that contains the database structure and content, which can then be imported into any MySQL server. As the dataset is structured in tabular format, we also make all three versions available in .csv format, which enables easy process by notebooks.
Acknowledgments
This work is supported by CNPq, Brazil.