Added a proxy for model swapping#1645
Conversation
|
Hmm I think
|
|
Would a seperate file play nice with the pyinstallers? |
|
If communicating solely thru API I don't see why not.
This should be doable cleanly as an entirely separate program. However, SSE streaming will be more challenging. |
|
Either way it should be the regular koboldcpp setting it up. I dont want the mess of users having to start seperate things for this feature. However we do it should be a thing the main koboldcpp launcher / cli starts when admin mode is in use Personally I do think integrating it into koboldcpp.py makes sense. |
|
I've updated this to use the model name. I also implemented support for getting the list of models instead of the currently active model. This was enough to get it working in open webui. I've updated the original comment with a checklist. |
|
The idea is that it would show the configs the admin api already shows. You get multi modal for free since it accepts kcpps files. |
|
I could be missing something, but I don't think that config files give us multi modal for free. If the goal is to keep at most one model loaded for each modality, having a config file either comes with a drawback (implementation 1) or requires splitting configs for each modality (implementation 2). There's a two ways I can think to implement multi modality:
|
I've just tried to implement it using this way but I ran into a problem. The API responds before the server has switched over. If I try to connect to the server right after the server responds, the old server is still active so it errors out when the connection gets closed. I added a sleep to try to wait until the old server closes but it feels like this method is unreliable. |
For this particular feature you could poll the model endpoint until it matches the model you want (or you get "no model loaded" in case of an error) |
|
Based on @pqnet suggested, I swapped over to the admin api. |
#3 would be to process the chat completion's history of messages according to modality, and feed them to text/vision/audio/other-modality-capable mapped models accordingly, translating all to the nearest "dumbed-down" equivalent that the main model can support. i.e., receive image, but main llm only handles text? call image-to-text model, replace, cache prompt, feed to llm. HOWEVER, #1 through #3 are HACKS. True multimodality (#4) means that the same model can look at the image, and the audio, and video, and text, and whatever else, load it ALL into the same high dimensional "mind", and "figure out what it all means" before rendering it back down into an answer. Converting it all to text and feeding it into a text model is very different. But #3 in particular would be a reasonable HACK and to my mind the only one that's API-correct if we're aiming to be multi-modal and emulate models like GPT-4o accessed via OpenAI API's. What a similar approach, like #1 or llama-swap (which is really #1 as a proxy using #2) does for you, though, is to let you configure one server endpoint that LISTS multiple models and can serve them, and then the CLIENT can switch between, IFF it has separate configuration for each modality. That's just the client and server supporting multiple models and modalities separately though, not "multi-modality". |
This is a very rough version of a proxy for kobold so that it can swap models for each request.
Only text models are supported but that'll be fixed as well. First step towards #1623.
This can currently be used w/ things like open webui to chat with multiple models.