Skip to content

LLM prompts need to be updated for modern LLMs #83

@remlaps

Description

@remlaps

LLM prompt calibration issues across current provider/model mix

Since Thoth's inception, the available LLMs and their behaviors have changed substantially. The current curation prompts no longer perform consistently across the models used by the test account.

Problem

The same prompt produces significantly different screening behavior depending on the active provider/model combination:

  • ArliAI (Qwen-3.5-27B-Derestricted, free tier)

    • Extremely slow response times (This is by design for free tier, but I think it should be tolerable if acceptance rates are high enough).
    • Rejects nearly all posts
    • Currently not practical for Thoth screening
  • Google (Gemma-4-31b-it, free/paid)

    • More permissive than desired
    • Allows too many marginal posts through screening
  • Google (gemini-3.1-flash-lite-preview, free/paid)

    • Rejects nearly all posts
    • Currently unsuitable for Thoth screening

Put together, this all means that there is very little capability for multi-model redundancy in the event of API/backend issues.

Goals

Improve prompt reliability and evaluation consistency across supported models.

Proposed work

  1. Rework the screening prompt to better normalize behavior across models

    • Increase strictness for Gemma-4-31b-it
    • Increase tolerance/permissiveness for the more rejection-prone models
    • Attempt to maintain a single shared prompt if feasible
  2. If a shared prompt is not viable, reimplement model-specific prompt templates

    • Automatically switch templates during provider/model failover within the same run
    • Treat this as a fallback approach rather than the preferred architecture
    • This could constrain the operator's ability to deploy their own custom prompt.
  3. Benchmark all three models using free-tier access

    • Measure:

      • A complete Thoth run (curating 5 posts) must be able to finish in 2-4 hours or less, regardless of the date/time of the blockchain content.
      • The rejection rate from all supported models must be low enough to enable this.
      • If possible, Thoth should be able to complete 2 free-tier runs per day (5 posts each) using either Google or ArLiAI as the LLM provider.
      • Accepted posts that are published by the Thoth test (thoth.test) account are all/nearly all of reasonable quality for curation.
  4. Validate API parameters

    • The current API parameters (temperature, repetition_penalty, frequency_penalty, etc...) were established for a model that isn't used any more.
    • Evaluate whether any of them need to be changed.
  5. Add a standalone evaluation/testing script

    • Allow operators to test a specific author or post directly
    • Return prompt evaluation results immediately
    • Eliminate the need for Thoth to search the blockchain for candidate posts just to test the prompts.
    • FYI: It is currently possible to test the prompts on individual posts with custom skills in Brave Leo.

Possible future enhancements

  • Build a regression suite using known “accept” and “reject” examples
  • Update curation statistics at the end of the run to include LLM rejection rates and save them in CSV format to enable comparison over time.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions