Search Engine README

Overview

This project is a web-based search engine built using Flask, PyTerrier, and various NLP techniques. It processes, indexes, and retrieves documents from the CISI dataset. Additionally, it uses ELMo embeddings for query expansion and calculates various information retrieval metrics.

Features

Document Processing: Cleans and preprocesses text data by removing special characters, stop words, and applying stemming.
Query Expansion: Expands user queries using ELMo embeddings to improve search results.
Information Retrieval: Uses PyTerrier's TF_IDF model for retrieving relevant documents.
Evaluation Metrics: Calculates precision, recall, and mean average precision (MAP) for the search results.

Requirements

Python 3.x
Flask
pandas
pyterrier
nltk
scikit-learn
tensorflow
tensorflow_hub

Setup

Clone the repository:

git clone https://github.com/manohosny/search-engine.git
cd search-engine

Install dependencies:
```
pip install -r requirements.txt
```

Download NLTK data:

import nltk
nltk.download('stopwords')
nltk.download('punkt')

Download ELMo embeddings:

import tensorflow_hub as hub
elmo = hub.load("https://tfhub.dev/google/elmo/3")

Prepare the dataset: Ensure the CISI dataset (cisi.zip) is in the project directory. The dataset will be extracted and processed automatically.

Running the Application

Start the Flask server:
```
python app.py
```
Access the web interface: Open your browser and go to http://127.0.0.1:5000/.

Endpoints

`POST /search`

Description: Searches for documents based on the provided query.
Request: JSON object with a query field.
```
{
    "query": "sample search query"
}
```

Response: JSON array of search results with document ID, text, score, and retrieval time.

[
    {
        "doc_id": "1",
        "text": "Document content here...",
        "score": 0.85,
        "retrieval_time": 0.123
    },
    ...
]

`GET /`

Description: Renders the home page.
Response: HTML page.

Code Structure

app.py: Main Flask application file containing routes and search logic.
templates/: Directory for HTML templates.
static/: Directory for static files (CSS, JS, images).
cisi_dataset/: Directory where the CISI dataset will be extracted.

Key Functions

Text Processing

clean(text): Removes special characters, tabs, line jumps, and extra white spaces from the text.
remove_stopwords(text): Removes stop words from the text.
Stem_text(text): Applies stemming to the text.
preprocess(sentence): Combines cleaning, stop words removal, and stemming.

Dataset Handling

load_cisi_dataset(data_dir): Loads the CISI dataset.
read_documents(documents_path): Reads documents from the CISI.ALL file.
read_queries(queries_path): Reads queries from the CISI.QRY file.
read_qrels(qrels_path): Reads relevance judgments from the CISI.REL file.

Query Expansion

expand_query(input_query): Expands the input query using ELMo embeddings.

Evaluation Metrics

precision_at_k(actual, predicted, k): Calculates precision at K.
recall_at_k(actual, predicted, k): Calculates recall at K.
average_precision(actual, predicted): Calculates average precision.
mean_average_precision(actuals, predicteds): Calculates mean average precision.

Notes

Ensure TensorFlow and TensorFlow Hub are correctly installed to load ELMo embeddings.
Modify cisi.zip path if it is located in a different directory.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
DatasetIndex		DatasetIndex
__pycache__		__pycache__
cisi_dataset		cisi_dataset
templates		templates
terrier-prf		terrier-prf
README.md		README.md
app.py		app.py
cisi.zip		cisi.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Engine README

Overview

Features

Requirements

Setup

Running the Application

Endpoints

`POST /search`

`GET /`

Code Structure

Key Functions

Text Processing

Dataset Handling

Query Expansion

Evaluation Metrics

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Search Engine README

Overview

Features

Requirements

Setup

Running the Application

Endpoints

POST /search

GET /

Code Structure

Key Functions

Text Processing

Dataset Handling

Query Expansion

Evaluation Metrics

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`POST /search`

`GET /`

Packages