This Python project showcases Natural Language Processing (NLP) capabilities by enabling users to query Google Drive documents within a specified folder using the transformers library. The code emphasizes good code quality, adherence to SOLID principles, and a well-structured codebase. Additionally, it demonstrates how to authenticate with Google Drive using OAuth 2.0 credentials dynamically.
The primary motivation behind this project is to illustrate the power of NLP in querying online documents. By leveraging the transformers library, users can extract valuable information from a collection of Google Drive documents, making it a versatile tool for various applications such as information retrieval, data analysis, and more.
In order to provide a quick taste of the result, this is a recording showing the execution of the code against a Google Drive folder with 3 documents (all of them are provided here in the /examples folder)
In this example, we are reading two Word documents with information about CERRADO and AMAZON biomes and one PDF book about the allegorical representation of elephants in literature. Then, making some queries that will retrieve the answer from the relevant documents and also the cosine similarity score between the query and the document.
My.Movie.1.mp4
- NLP-based querying of Google Drive documents.
- OAuth 2.0 authentication for secure access.
- Demonstrates good code quality and coding principles.
- Makefile commands to improve Developer Experience (DX).
- CI/CD pipeline using Github Actions, running on every push to the main branch.
- Code coverage report using
pytest-covandcodecovto ensure code quality. - Code style enforcement using
flake8,isort, andblackto ensure code quality. - pre-commit hooks to ensure code quality and security.
The project structure is organized as follows:
hudson_utils/authentication.py: Handles OAuth 2.0 authentication with Google Drive.hudson_utils/google_drive.py: Provides methods to interact with Google Drive, including fetching documents from a specific folder.hudson_utils/text_processing.py: Defines theTextProcessorclass, which extracts text from documents, combines them, and processes NLP queries usingtransformers.hudson_utils/args.oy: Utility to retrieve command-line arguments values.hudson_utils/main.py: Main entry point of the application, where everything is orchestrated.config/: Contains the json file with the OAuth 2.0 credentials to Hudson Dias's development account and also will temporarily hold the token.pickle file generated by the authentication process.
To use this project, follow these steps:
In order to run this without setting up all the googl cloud console stuff, you need to have a hudson-dias-google-drive-crd.json file in the config folder. This file is not commited to the repo for security reasons, but you can get it from me on the following channels:
- Whatsapp - +55 61 999 378 984
- Email - diogo@dhdtech.io
- iMessage - +55 61 999 378 984 or diogo.hudson@gmail.com
Also, provide me your gmail account where the .doc files are stored, so I can add you as a test user of the google cloud project.
Install the required dependencies, including transformers, pytorch, and Google OAuth libraries, using the following command:
This project has a Make file to help you with the development process, so you can run the following command to install all the dependencies:
make configure_develIf you prefer to install the dependencies manually, you can run the following command (don't forget to create/activate your virtual environment)
pip install -r requirements.txt- Activate your virtual environment. If it was created by
configure_develcommand, you can run the following command:
source venv/bin/activate- Run the code by executing main.py and passing the desired value for
thresholdandfolder_nameparameters.
thresholdis the minimum cosine similarity score between the query and the document for it to be considered a match, if not specified, the code will use a default value of 0.5.folder_nameis the name of the folder in Google Drive to search for documents, if not specified, the code will search the entire drive.
python main.py --threshold=0.5 --folder_name=folder_with_documentsIf the first time you run the code, your default browser will open to authenticate with Google Drive. After that, code will continue to run and you will see the results in the terminal.
If you plan to use this repo for any kind of purpose and wants to contribute to it, you will find useful we have some make commands to help you with the development process and also keep code quality, style and security.
By runnin make you will see all the available commands:
makeThe result will be something like this and they are self-explanatory:
############################################### Hudson Dias Makefile ################################################
help Show this help message
lint_and_format Runs flake8, isort and black against the codebase
configure_devel Cleans up the environment and installs the development dependencies
#####################################################################################################################Code Features:
- Handle other documents other than
docfiles (e.g.pdf,txt, etc.) - Add an online NLP service to process the queries (e.g. Google Cloud Natural Language API or even OpenAI's GPT-3)
CI/CD Features:
- Add missing unit tests
- Adds caching on Github Actions to speed up the CI/CD process
- Create/Update Coverage badge on README.md after each CI/CD run
- Add minimum code coverage threshold to the CI/CD process. (As of now, it is only checking if the tests are passing)