Added Code Understanding sub agent by tmihalac · Pull Request #231 · RHEcosystemAppEng/vulnerability-analysis

tmihalac · 2026-05-06T10:27:03Z

Improve the general agent code understanding capabilities.
Identify code understanding questions, in order to route/dispatch to this sub-agent
focus on Collecting relevant context for code understanding questions.
Use Documentation tool with the embedding as one of its tool

Fixes https://redhat.atlassian.net/browse/APPENG-4532

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac · 2026-05-06T11:21:08Z

/test vulnerability-analysis-on-pr

tmihalac · 2026-05-06T11:55:14Z

/test-heavy

zvigrinberg · 2026-05-06T13:54:06Z

/test vulnerability-analysis-on-pr

zvigrinberg · 2026-05-06T14:27:23Z

/test vulnerability-analysis-on-pr

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac · 2026-05-07T08:46:55Z

/test vulnerability-analysis-on-pr

tmihalac · 2026-05-07T09:44:26Z

/test-heavy

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac · 2026-05-07T15:41:27Z

/test-heavy

tmihalac · 2026-05-10T06:44:00Z

/test-heavy

tmihalac · 2026-05-10T09:06:45Z

/test-heavy

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac · 2026-05-10T11:54:24Z

/test-heavy

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac · 2026-05-10T18:13:31Z

/test-heavy

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac · 2026-05-11T10:17:12Z

/test-heavy

RedTanny · 2026-05-11T08:22:53Z


 logger = LoggingFactory.get_agent_logger(__name__)

+_MAX_RHSA_CANDIDATES = 20


@tmihalac Hi theo can you explain your edge case why you need to add a limit on _MAX_RHSA_CANDIDATES and why did you decided to limit only 20

There was a case with ~1200 and most of them were not even libraries but images so the prompt was ~32k, claude suggested 20

tmihalac · 2026-05-11T14:11:36Z

/test-heavy

tmihalac · 2026-05-11T14:14:16Z

/test vulnerability-analysis-on-pr

zvigrinberg

@tmihalac Very good job!.
Please see my comments, most of them are minors

zvigrinberg · 2026-05-11T08:51:37Z

+
+Question: {question}
+
+Examples:


@tmihalac Maybe we can add another one example of code_understanding that is not related to xml or version checking? So it will not be biased toward XML or version only examples.

For example:
Does the code sanitize input against both path traversal characters AND command injection metacharacters (; | & $ \ \n)?`,

Does the application decode URL-encoded input before validating pathcomponents?

Are there any input validation or sanitization mechanisms in place to prevent malicious Markdown input from reaching the paragraph function?

Can a single request force the application server to load unbounded data into memory?

Can malformed input cause an unhandled exception that crashes or restarts the process?

zvigrinberg · 2026-05-11T11:56:31Z

@@ -1,28 +0,0 @@
-import pytest


@tmihalac Why did you delete this test?

The _has_c_cpp_sources method was used by TransitiveCodeSearcher to detect if a repo had C/C++ source files. It was removed during the refactoring, the C/C++ detection now relies on the ecosystem being set from the pipeline input (image.ecosystem), not from scanning repo files.

The logic of detecting the ecosystem moved to detect_ecosystem() in dep_tree.py

Will add an updated test file

zvigrinberg · 2026-05-11T12:07:12Z

+        "Finds all imports and usage patterns of a package/module across sources. "
+        "Reports which files import it, usage count, and how it's used. "
+        "Searches all sources including framework dependencies."


Minor - When code index empty, tool returns No source code indexed , and it let the LLM to infer and understand itself dynamically that the tool shouldn't be invoked and/or that the result is of no use.

zvigrinberg · 2026-05-11T12:24:31Z

+
+# ---- Prompt Templates ---- #
+
 AGENT_SYS_PROMPT = (


@tmihalac I Would refactor across usages and rename it to REACHABILITY_AGENT_SYS_PROMPT, And also add comment that prefix that set of prompts for the reachability sub-agent, that states that these prompts are for the reachability sub-agent.

zvigrinberg · 2026-05-11T12:26:56Z

-OBSERVATION_NODE_PROMPT = COMPREHENSION_PROMPT
-
-### --- End of REACT Prompt Templates ----#
 def build_system_prompt(


@tmihalac I Would rename + refactor to build_reachability_system_prompt as right now it's confusing.

zvigrinberg · 2026-05-11T12:30:26Z

+                try:
+                    with open(full_path, "r", errors="ignore") as f:
+                        content = f.read()
+                    if len(content) < 500_000:


This is a feasible assumption, but what happens if the config file is bigger than that? it's just ignores it.... Maybe worthwhile generating a warning that states the config file name is ignored because of its size?

zvigrinberg · 2026-05-11T12:31:22Z

+logger = LoggingFactory.get_agent_logger(__name__)
+
+
+def format_context_snippet(lines: list[str], match_line: int, context_lines: int) -> str:


zvigrinberg · 2026-05-11T13:42:22Z

+async def configuration_scanner(config: ConfigurationScannerToolConfig, builder: Builder):
+    from vuln_analysis.runtime_context import ctx_state, cu_source_scope
+
+    _config_files_cache: dict[str, list[tuple[str, str]]] = {}


Potential race condition because write operations on dict are not atomic and is possible when processing concurrently two different CVEs with same repo.

You should use lock on the repo_path/repo_key, similar like we do in https://github.com/tmihalac/vulnerability-analysis/blob/cbbbe6bc24b7214476e2b32b66d5176a629b0654/src/exploit_iq_commons/utils/document_embedding.py#L266-L280

zvigrinberg · 2026-05-11T13:53:52Z

+    def get_instance(cls, cache_path: str, tokenizer=False) -> "FullTextSearch":
+        """Return a cached instance for the given path, with LRU eviction.
+
+        On cache hit, moves the entry to the end (most recently used).
+        On cache miss, creates a new instance and evicts the least recently used
+        if the cache exceeds _INSTANCE_CACHE_MAX.
+        """
+        if cls._instances is None:
+            cls._instances = OrderedDict()
+        if cache_path in cls._instances:
+            cls._instances.move_to_end(cache_path)
+            return cls._instances[cache_path]
+        instance = cls(cache_path=cache_path, tokenizer=tokenizer)
+        cls._instances[cache_path] = instance
+        if len(cls._instances) > cls._INSTANCE_CACHE_MAX:
+            evicted = cls._instances.popitem(last=False)
+            logger.debug("Evicted LRU FullTextSearch instance: %s", evicted[0])
+        return instance
+


@tmihalac This method being called from two tools , and potentially from multiple tasks concurrently ( checklist items).
the ordered dict is not thread safe, this the write ( move_to_end) and pop item should become atomic ( make these within locked scope).

tmihalac · 2026-05-11T14:40:11Z

/test vulnerability-analysis-on-pr

zvigrinberg · 2026-05-11T14:42:53Z

@tmihalac Another thing that i forgot to mention, please attach the license at the header of several new source files ( i've noticed it's absent in few cases.

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

… review Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac · 2026-05-11T18:51:39Z

/test-heavy

tmihalac · 2026-05-12T05:43:36Z

/test-heavy

zvigrinberg · 2026-05-12T06:11:44Z

-                _config_files_cache[repo_key] = _collect_config_files(repo_key)
+            repo_key = (si.git_repo, si.ref)
+            if repo_key in _config_files_cache:
+                _config_files_cache.move_to_end(repo_key)


@tmihalac move_to_end method also should be protected as it's not thread/concurrency safe.

zvigrinberg · 2026-05-12T06:23:31Z

 Examples:
 - "Is the vulnerable function XStream.fromXML() called from application code?" → reachability
 - "Is the application configured to use the affected XML parser?" → code_understanding
 - "Does the application process untrusted XML input from external sources?" → code_understanding
 - "Can untrusted data reach BeanUtils.populate() through the call chain?" → reachability
 - "Is the vulnerable version of commons-beanutils installed?" → code_understanding
 - "Is the function parseXML() reachable from any HTTP handler?" → reachability
 - "Does the application enable external entity processing in its XML configuration?" → code_understanding
 - "Is SslHandler used in the application?" → reachability (SslHandler is from the vulnerable package)
 - "Is HttpPostStandardRequestDecoder.offer() called by application code?" → reachability
+- "Does the code sanitize input against path traversal characters and command injection metacharacters?" → code_understanding
+- "Are there input validation mechanisms to prevent malicious Markdown input from reaching the paragraph function?" → code_understanding
+- "Can a single request force the application server to load unbounded data into memory?" → code_understanding
+- "Can malformed input cause an unhandled exception that crashes or restarts the process?" → code_understanding


@tmihalac Can you add 2 more reachability examples and maybe remove 1 of CU for balancing the numbers of reachability and code understanding examples?

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac · 2026-05-12T07:40:30Z

/test-heavy

tmihalac · 2026-05-12T10:15:48Z

/test-heavy

Added Code Understanding sub agent

fc058ed

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac requested review from RedTanny and zvigrinberg May 6, 2026 10:27

Removed java tests added by mistake

6ea3704

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Fix to prevent tools being called multiple time with the same input

cc04b4a

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac added 2 commits May 7, 2026 17:27

Lowered the cpu requests for the confusion matrix

8603f64

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Changed the cpu requests for the confusion matrix

d34f61a

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Fixed GraphRecursionError error keeps happening

b18ff7b

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Fixed Package filter prompt overflow

a777ce8

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Fixed forced_finish_node token over limit size

cbbbe6b

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

RedTanny reviewed May 11, 2026

View reviewed changes

zvigrinberg requested changes May 11, 2026

View reviewed changes

tmihalac added 2 commits May 11, 2026 17:52

Added more tests following review

e166133

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Added a note to _LoggingEmbeddingProxy following review

51e6f86

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

tmihalac added 10 commits May 11, 2026 18:11

Added more examples to dispatcher following review

21d54c4

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Fixed IMPORT_USAGE_ANALYZER tool availability following review

f6f305d

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Renamed the reachability prompts following review

bef2863

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Renamed build_system_prompt func following review

52d35bf

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Add warning that config files bigger than 500k are skipped, following…

d1e8a36

… review Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Add concurrency handling to configuration_scanner following review

12f77d0

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Add concurrency handling to configuration_scanner following review

1d13cb1

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Add concurrency handling to full_text_search following review

9ee97d2

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Added license info following review

d1c9af5

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Added license info and removed unused imports following review

ad68555

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

zvigrinberg reviewed May 12, 2026

View reviewed changes

tmihalac added 2 commits May 12, 2026 09:58

Added lock to move_to_end call, following review

4d8cdcf

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>

Balanced dispatcher examples, following review

929487b

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>


		logger = LoggingFactory.get_agent_logger(__name__)

		_MAX_RHSA_CANDIDATES = 20

		logger = LoggingFactory.get_agent_logger(__name__)


		def format_context_snippet(lines: list[str], match_line: int, context_lines: int) -> str:

Conversation

tmihalac commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmihalac commented May 6, 2026

Uh oh!

tmihalac commented May 6, 2026

Uh oh!

zvigrinberg commented May 6, 2026

Uh oh!

zvigrinberg commented May 6, 2026

Uh oh!

tmihalac commented May 7, 2026

Uh oh!

tmihalac commented May 7, 2026

Uh oh!

tmihalac commented May 7, 2026

Uh oh!

tmihalac commented May 10, 2026

Uh oh!

tmihalac commented May 10, 2026

Uh oh!

tmihalac commented May 10, 2026

Uh oh!

tmihalac commented May 10, 2026

Uh oh!

tmihalac commented May 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmihalac commented May 11, 2026

Uh oh!

tmihalac commented May 11, 2026

Uh oh!

zvigrinberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmihalac May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmihalac commented May 6, 2026 •

edited

Loading

tmihalac May 11, 2026 •

edited

Loading

zvigrinberg May 12, 2026 •

edited

Loading