Skip to content

WIP: Update outlines (and outlines prompts) and transformers versions#256

Draft
dmjoy wants to merge 19 commits intomainfrom
dev/update-outlines-and-transformers
Draft

WIP: Update outlines (and outlines prompts) and transformers versions#256
dmjoy wants to merge 19 commits intomainfrom
dev/update-outlines-and-transformers

Conversation

@dmjoy
Copy link
Contributor

@dmjoy dmjoy commented Nov 6, 2025

@eveenhuis @jadie1 @barry-ravichandran FYSA

This is a WIP, but wanted to draw attention to some of the changes that need to be made for the latest outlines version. Specifically:

  • The outlines.pompt decorate has been removed, now the guidance is to use the outlines.Template.from_string function.
    • Copied this decorator in from the older outlines version for near term "compatiblity" rather than require changes to all of these old templates (using outlines.Template.from_string we also need to change how these templates are invoked wherever they're used, and that's scattered around several algorithms; not worth) future templates should strongly considering using this new approach though
  • The notion of a sampler has been removed from outlines, instead inference control parameters are just passed through to whatever backend is loading / serving the model (so for determinstic sampling using transformers as the model provider we would set do_sample=False in the generation_kwargs)

@dmjoy dmjoy force-pushed the dev/update-outlines-and-transformers branch from bf92614 to 542d191 Compare November 6, 2025 21:21
@dmjoy
Copy link
Contributor Author

dmjoy commented Nov 7, 2025

Alright so functionally this all seems to work, but when running the integration tests, there are several differences (mostly that makes sense as enough pieces have changed here that I wouldn't expect to get the exact same output). However, it seems like there's some non-determinism even with do_sample=False, and for the comparative regression ADMs I'm getting negative regression values in some cases, which I don't think we've seen before. Digging into the negative regression values though and it seems like maybe this was never enforced with JSON schemas?
dottxt-ai/outlines#215
dottxt-ai/outlines-core#147

I need to spend some more time with this to understand what's really going on before feeling confident in merging. Anecdotally it does seem a bit faster now, but if it's not deterministic and doesn't enforce the entire schema what's the point.

@barry-ravichandran
Copy link
Contributor

@dmjoy Tagging this package update PR here in case it's ok with you to merge it into your PR to fix some issues with installing via poetry: #250

@dmjoy dmjoy force-pushed the dev/update-outlines-and-transformers branch from 6ffb3cf to e76e5a5 Compare January 29, 2026 19:59
@alright-code
Copy link

The problem causing the non-determinsm and invalid regression values was improper handling on the outlines side of vocabulary creation: dottxt-ai/outlines#1831. This was merged recently into outlines main, but we are blocked from directly using this due to vllm dependencies. I've implemented a monkey patch fix we can use for the time being.

There was a second bug in outlines here: dottxt-ai/outlines#1817. Same thing with above, we are blocked from the official fix because of vllm.

Copy link
Contributor Author

@dmjoy dmjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alright-code I think these changes on the whole are fine, I just left a few comments / change requests. I believe the integration test differences are within reason, but before merging I will want to check out this branch and run one or two tests with the latest data and configs to make sure the scores we're getting back are still good.

from outlines_core import Vocabulary


# monkey patch to fix https://github.com/dottxt-ai/outlines/pull/1831
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know what version of outlines should have this fix (probably one higher version than the latest one assuming they haven't cut a release yet with this fix merged) we should note it here to make it easier on our future selves.

encoded_dialog = tokenizer.apply_chat_template(dialog)

return tokenizer.decode(encoded_dialog) No newline at end of file
# class SpectrumTunedInferenceEngine(OutlinesTransformersInferenceEngine):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inference engine is actually needed / used for our recent evaluation experiments so we should leave it in (not commented out) though maybe it needs to be updated wrt the outlines update? (FWIW I'm not a fan of how it's implemented here since essentially it's just changing how prompts are computed from the dialog, which to me doesn't really justify an entirely new inference engine)

bert-score = "^0.3.13"
rich = "^13.6.0"
rouge-score = "^0.1.2"
swagger-client = {git = "https://github.com/NextCenturyCorporation/itm-evaluation-client.git", rev = "0.5.2"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to still pin the swagger-client version to 0.5.2, not sure if that's still being pinned with the uv specification now

'kdma_values': [{'kdma': 'Moral judgement', 'value': 0.2}]}
[bold]*CHANGED SCENE TO*: P1[/bold]
PyTorch version 2.3.1+cu118 available.
HTTP Request: HEAD https://huggingface.co/roberta-large/resolve/main/config.json "HTTP/1.1 200 OK"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already some mechanism(s) for filtering out lines during the integration test diff. We probably want to exclude these "HTTP Request" lines as they can be non-deterministic

"Moral judgement": 0.55
}
}
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see your point about these timing lines and whether or not there should be an option to disable them. If there's not an easy way to filter them out in the diffs (with the integration test script) then yes it's probably a good idea to have an option in the pipeline_adm.py code for reporting these timing stats (should be on by default, but we can disable them for the integration test runs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file looks like it's been hit by some auto-linting (and similar story with one or two of the other files). This particular file was provided by a sub, and I would rather than change it at all in case they deliver some updates to us so that the diff is much cleaner. Could either copy the version of this file from main and overwrite this version or revert the particular commit that made these changes.

In generally I would prefer not to auto-lint existing files since it tends to blow up the diffs with superficial changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants