Enhance access to the central collections of the Staatsarchiv Zürich with an intelligent hybrid search application.
This repository provides the production-ready code for our search app, which is available online here.
To set up the app:
Install uv for environment management.
- Clone this repository.
- Change into the project directory:
cd ai-search_staatsarchiv/ - Install the required packages:
uv venvand thenuv sync
- Run the notebooks. Open them either in an IDE like Visual Studio Code. Alternatively, you can use Jupyter Notebook or Jupyter Lab.
- Use the final notebook to create the Weaviate search index. Data is stored by default in
.local/share/weaviate/. If you are deploying the app on a remote machine, copy the index data to the same path on the remote machine, or change the path in the app like so:client = weaviate.connect_to_embedded(persistence_data_path="/your_data_path_on_your_vm/").
- Start the app:
uv run streamlit run _streamlit_app/hybrid_search_stazh.py
Note
The app logs user interactions locally to a file named app.log. If you prefer not to collect analytics, simply comment out the relevant function call in the code.
For the embeddings, we use Jina AI's model jina-embeddings-v2-base-de. The model is a German/English bilingual text embedding model supporting 8,192 sequence length.
According to the model card, it is designed for «high performance in mono-lingual & cross-lingual applications and trained … specifically to support mixed German-English input without bias». Technical report here.
Note that we chunk all text on a sentence basis to a maximum of 500 tokens with a 100-token overlap.
The Staatsarchiv Zürich manages and catalogs the «Zentralen Serien des Kantons Zürich 19. und 20. Jahrhundert», which includes important historical documents such as minutes from the Cantonal Council, Government Council resolutions, collections of laws, and the Official Gazette. These records span from 1803 to the present, making them linguistically and thematically diverse.
We (the Staatsarchiv and the Statistical Office) developed an intelligent search application that enhances access to these extensive archives.
For more information, see the following article in the magazine ABI Technik: Mit Künstlicher Intelligenz zu besserer Nutzbarkeit: Die Zentralen Serien des Kantons Zürich (19. und 20. Jahrhundert) neu zugänglich gemacht
This app allows users to search through these extensive archives using both lexical and semantic search methods. Unlike traditional lexical search that looks for exact keywords, semantic search identifies words, sentences, or paragraphs with similar meanings, even if they don't exactly match the search term. For example, a search for «technology» might return documents containing related concepts like «digitalization», «artificial intelligence», «software development», or «computer science» even if «technology» isn't mentioned directly.
Additionally, semantic search can retrieve documents related to a reference text. For instance, entering a document reference like RRB 1804/1 will return documents with similar themes.
Semantic search leverages statistical methods and machine learning to analyze large text corpora, allowing models to learn word and sentence similarities, enabling more nuanced document retrieval. While semantic search offers significant benefits, results can sometimes be incomplete or include irrelevant matches.
- Hybrid search significantly improves search results compared to traditional lexical search, especially for complex or fuzzy queries and large corpora spanning over two centuries.
- The embedding models we tested (and the one we use in the app) are astonishingly agnostic to the historical language used in the documents. Based on our observations, these models can capture the semantic meaning of very old texts as well.
- Weaviate has proven to be a reliable and efficient tool for semantic search. It is easy to use and integrates well with Python.
- The app is inexpensive to run and maintain. It can be deployed on a local machine or a virtual machine with moderate resources. At the moment, we use a VM with 8 CPUs and 32 GB RAM.
Rebekka Plüss (Staatsarchiv) and Patrick Arnecke (Statistisches Amt, Team Data). A big thanks goes to Sarah Murer and Dominik Frefel as well.
We welcome your feedback. Please share your thoughts and let us know how you use the app in your institution. You can email us or contribute by opening an issue or a pull request.
Please note that we use Ruff for linting and code formatting with default settings.
The software in this project is licensed under the MIT License. See the LICENSE file for details.
This software (the Software) incorporates open-source models (the Models) from providers like Huggingface. The app has been developed according to and with the intent to be used under Swiss law. Please be aware that the EU Artificial Intelligence Act (EU AI Act) may, under certain circumstances, be applicable to your use of the Software. You are solely responsible for ensuring that your use of the Software as well as of the underlying Models complies with all applicable local, national and international laws and regulations. By using this Software, you acknowledge and agree (a) that it is your responsibility to assess which laws and regulations, in particular regarding the use of AI technologies, are applicable to your intended use and to comply therewith, and (b) that you will hold us harmless from any action, claims, liability or loss in respect of your use of the Software.
