Skip to content

mmrezaee/ArXivQuest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArXivQuest: Ask Questions From ArXiv Papers!

ArXiv now offers an HTML version of (recent) papers. This project provides a web-based interface that allows users to load, view, and interactively query these HTML documents. It uses Milvus to retrieve and highlight document sections relevant to the user's queries and automatically scrolls to these highlights.

Problem Statement

Retrieving specific information from large documents based on user queries poses a significant challenge that demands efficient text searching capabilities. This project leverages vector embeddings for semantic search to address this challenge.

Features

Image

This application is built on Flask and is designed to run locally. Start the server by running:

python app.py
  • HTML Document Viewer: Loads and displays HTML content from specified URLs.
  • Interactive Query Interface: Enables users to input questions and receive contextually relevant answers.
  • Highlighting Relevant Content: Automatically highlights and scrolls to sections of the document relevant to the user's query.
  • Milvus Integration: Utilizes Milvus for efficient retrieval of document sections based on vector similarity.

Approach

The project follows these steps:

  1. HTML Preprocessing: Parses HTML using BeautifulSoup to extract sentences and tables.
  2. Sentence Embedding: Uses the SentenceTransformer library to transform extracted sentences into vector embeddings, crucial for efficient similarity searches.
  3. Milvus Collection and Indexing: Establishes a Milvus Collection to store sentence embeddings and creates an index for fast and accurate retrieval.
  4. Query Processing: Converts sample questions into embeddings using the same SentenceTransformer model to ensure consistency.
  5. Exploring Different Configurations: Explores various configurations to search for similar texts based on query embeddings, testing different combinations of the following parameters:

Project Structure

  • app.py: Main Flask application file.
  • llm_qa.py: Converts arXiv papers and questions to sentence transformer embeddings and retrieves the top K answers.
  • utils.py: Provides preprocessing tools for extracting sentences and tables from papers.
  • templates/: Contains HTML files for the web interface.
  • static/: Stores CSS and JavaScript files.
  • requirements.txt: Lists all Python libraries required by the project.

Installation

About

ArXivQuest: A simple web-based application that facilitates dynamic question-answering on arXiv's HTML papers, automatically ranking responses and highlighting relevant sections to streamline information retrieval.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors