PPORTAL :: Public Domain Portuguese-language Literature Dataset

Contributions

Easy Accessible

Unrestricted access in two formats: SQL database and compressed .csv files

Centralized

Integration of three digital libraries: Domínio Público, Projecto Adamastor and BLPL

Enriched Metadata

Enriched metadata for works, authors and online reviews extracted from Goodreads API

82,313

Public Domain Works

2,388

Goodreads Works

966

Goodreads Authors

80

Goodreads Genres

4,240

Goodreads Reviews

1,430

Goodreads Readers

Abstract

Combining human expertise with book-consumers data may generate what is needed to sustain constant changes experienced in the book publishing market. Then, building and making available datasets that entirely comprise the essential elements of the book industry ecosystem is of great importance. However, little has been done in such a context regarding non-English languages, such as Portuguese. Hence, in this work, we introduce PPORTAL, a Public domain PORTuguese-lAnguage Literature dataset composed of books-related metadata. Besides a high-level overview of the building process and dataset content, we provide a brief exploratory data analysis to summarize its main characteristics. Moreover, we highlight possible applications to be elaborated, showing how PPORTAL can be used as a resource within different research domains.

Downloads

Schema

Features

PPORTAL

Public domain Portuguese-language literature Dataset

All collected and enriched data are available in three separate versions (Preliminary, Goodreads, and Full). Hence, we generate a dump file for each version that contains the database structure and content, which can then be imported into any MySQL server. As the dataset is structured in tabular format, we also make all three versions available in .csv format, which enables easy process by notebooks.

Schema

-->

Acknowledgments

This work is supported by CNPq, Brazil.