Digital Library

License

Dataset Description

Dataset based on the public domain dataset "Biblioteca Digital. Documentos en dominio público"
This Dataset is based on the 11 april of 2023 it could have change in the future.

Metadata Description

Field	Description
title	Title of the document
Author	Author or authors of the document
date	Date of the document
origin_country	Country of the document
language	Language of the document
subject	Subject of the document
genre	Genre of the document
digital_version	Digital version of the document
ocr	OCR version of the document
words	Number of words of the document
book_id	Identifier of the book
number_of_volumes	Indicate the number of volumes there are of the same book
entropy	Entropy of the document

Entropy values can be compared with the ones in the file 'mean_entropy.csv' which contains the mean entropy of 5 texts in spanish that have been properly converted to digital format.

The process to get the values and to select the ns to be calculated in the entropy where obtain from an analysis done in the next paper. Where the authors analyze the entropy of spanish texts and between the things they find out one is that the n should be calculated between 1 and 18, to get the best results.

File Description

The file are made up of 2 parts, the first one is the id of the book and the second is the volume identifier, which goes from 0 to number_of_volumes - 1.

Using the library

First steps

If its the first time you use this library you should download this file here and unzip it in the folder web_scraping/dominiopublico
Then you should run main.py, this will create a file containing all the books that are manuscripts and the ones that are books. This will take a while.
! I you had already run the program then this step just run main.py, it will do everything for you.
Then you would be asked to continue, if so you have to enter the directory where the files will be saved, if not given, they will be saved in the directory books

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.vscode		.vscode
__pycache__		__pycache__
entropy_calculation		entropy_calculation
text_analysis		text_analysis
web_scrapping		web_scrapping
.gitignore		.gitignore
README.md		README.md
clean_authors.txt		clean_authors.txt
clean_text.py		clean_text.py
entropy.py		entropy.py
main.py		main.py
metadata_cleaned.json		metadata_cleaned.json
metadata_cleaning.py		metadata_cleaning.py
regex_result.txt		regex_result.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digital Library

License

Dataset Description

Metadata Description

File Description

Using the library

First steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Digital Library

License

Dataset Description

Metadata Description

File Description

Using the library

First steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages