Skip to content

Conversation

@satp42
Copy link

@satp42 satp42 commented Oct 24, 2025

Added new settings to control embedding performance in packages/types/src/config.types.ts. Specifically:

  1. embedding_batch_size (number, default: 64)
  2. embedding_max_threads (number, default: 4)
  3. embedding_max_connections (number, default: 8)

Modified packages/backend-server/src/main.rs to accept additional command-line arguments for batch size, max threads, and max connections.

Updated packages/backend-server/src/server/mod.rs:

  • Added max_connections field to LocalAIServer
  • Implemented a semaphore or connection counter to limit concurrent client connections (currently unlimited thread spawning)
  • Configured rayon global thread pool using rayon::ThreadPoolBuilder before starting the server

Modified packages/backend-server/src/embeddings/model.rs:

  • Added batch_size field to EmbeddingModel struct
  • Replaced hardcoded batch size Some(1) at line 71 with configurable self.batch_size

Passed Configuration from Electron Main Process

Implemented Lazy Embeddings for Large Document Types

  • Extended lazy embeddings logic to include ResourceTextContentType::PDF, ResourceTextContentType::Document, and ResourceTextContentType::Article
  • These document types will get a generateLazyEmbeddings tag instead of immediate embedding generation
  • Embeddings will then be generated on-demand when documents are accessed in chat/search

Optimized Chunking Strategy

  • Increased max_chunk_size from 2000 to 2500 characters (reduces total chunks by ~20% while maintaining quality)
  • Kept overlap_sentences at 1 for continuity
  • This change reduced the number of embeddings needed per document

The expected impact of this PR:

  • Batch size increase (1 → 64): reduction in CPU overhead due to better model utilization
  • Thread pool limits: Prevents CPU saturation, keeps usage under control
  • Connection limits: Prevents thread explosion during bulk uploads
  • Lazy embeddings for large docs: Defers expensive operations until needed
  • Larger chunks (2000 → 2500): fewer embeddings to generate and store

Related to #28

@satp42 satp42 marked this pull request as draft October 24, 2025 22:10
@satp42 satp42 marked this pull request as ready for review October 24, 2025 22:11
@satp42 satp42 marked this pull request as draft October 24, 2025 22:12
@satp42 satp42 marked this pull request as ready for review October 24, 2025 22:13
@aavshr aavshr self-requested a review October 31, 2025 12:33
Copy link
Collaborator

@aavshr aavshr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @satp42

The pull request is doing several things and it's hard to gauge without any measurements whether we're solving the actual problem.

On config

userConfig is meant to be configured by the user, but exposing batch size, threads, connections config to the average user is not useful especially without the user knowing the internals of how the embeddings work.

Lazy embeddings

The lazy embeddings are mainly for write heavy resources (right now only the note). It could also make sense for other resources but it will lead to a case where when someone has a notebook in the context, it will take a long time before things are embedded when asking a question.

Embeddings batch and chunk size

We could perhaps be even more aggressive on the batch size depending on the users machine. This should be dynamic with a sane default (32 seems safe).

On the chunk size, it could even perhaps be lower as the embedding models scale with the token length so longer chunks will need more compute, and smaller chunks will actually be better for parallelization.

Rayon & connections limit

We should not use rayon's global builder, the embedding models use fastembed-rs which uses rayon underneath as well. The unlimited thread spawning does seem like the culprit for the CPU saturation but it could be solved by using a simple threadpool or we should use an async runtime (probably better for long term).

Also the default values with 4 max threads being contested by 8 max connections is not ideal.

The code also doesn't compile (culprit: https://github.com/deta/surf/pull/43/files#diff-1e2c482dbe66cf699a1c8731d573227090fb956ca6259f6797e27b551d410d24R156) .


We need to actually do some measurements on what the root cause actually is for the CPU saturation before making all these changes.

Can you please split the PR into just the embeddings batch size related change?

For the other changes, we should dicuss on this issue to get to the root cause and then discuss the right approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants