Skip to content

Added Code Understanding sub agent#231

Open
tmihalac wants to merge 22 commits into
RHEcosystemAppEng:mainfrom
tmihalac:code-understanding
Open

Added Code Understanding sub agent#231
tmihalac wants to merge 22 commits into
RHEcosystemAppEng:mainfrom
tmihalac:code-understanding

Conversation

@tmihalac
Copy link
Copy Markdown

@tmihalac tmihalac commented May 6, 2026

  • Improve the general agent code understanding capabilities.
  • Identify code understanding questions, in order to route/dispatch to this sub-agent
  • focus on Collecting relevant context for code understanding questions.
  • Use Documentation tool with the embedding as one of its tool

Fixes https://redhat.atlassian.net/browse/APPENG-4532

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac tmihalac requested review from RedTanny and zvigrinberg May 6, 2026 10:27
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac
Copy link
Copy Markdown
Author

tmihalac commented May 6, 2026

/test vulnerability-analysis-on-pr

@tmihalac
Copy link
Copy Markdown
Author

tmihalac commented May 6, 2026

/test-heavy

@zvigrinberg
Copy link
Copy Markdown
Collaborator

/test vulnerability-analysis-on-pr

1 similar comment
@zvigrinberg
Copy link
Copy Markdown
Collaborator

/test vulnerability-analysis-on-pr

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac
Copy link
Copy Markdown
Author

tmihalac commented May 7, 2026

/test vulnerability-analysis-on-pr

@tmihalac
Copy link
Copy Markdown
Author

tmihalac commented May 7, 2026

/test-heavy

tmihalac added 2 commits May 7, 2026 17:27
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac
Copy link
Copy Markdown
Author

tmihalac commented May 7, 2026

/test-heavy

2 similar comments
@tmihalac
Copy link
Copy Markdown
Author

/test-heavy

@tmihalac
Copy link
Copy Markdown
Author

/test-heavy

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac
Copy link
Copy Markdown
Author

/test-heavy

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac
Copy link
Copy Markdown
Author

/test-heavy

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac
Copy link
Copy Markdown
Author

/test-heavy


logger = LoggingFactory.get_agent_logger(__name__)

_MAX_RHSA_CANDIDATES = 20
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Hi theo can you explain your edge case why you need to add a limit on _MAX_RHSA_CANDIDATES and why did you decided to limit only 20

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a case with ~1200 and most of them were not even libraries but images so the prompt was ~32k, claude suggested 20

@tmihalac
Copy link
Copy Markdown
Author

/test-heavy

@tmihalac
Copy link
Copy Markdown
Author

/test vulnerability-analysis-on-pr

Copy link
Copy Markdown
Collaborator

@zvigrinberg zvigrinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Very good job!.
Please see my comments, most of them are minors

Comment thread src/exploit_iq_commons/utils/document_embedding.py

Question: {question}

Examples:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Maybe we can add another one example of code_understanding that is not related to xml or version checking? So it will not be biased toward XML or version only examples.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:
Does the code sanitize input against both path traversal characters AND command injection metacharacters (; | & $ \ \n)?`,

Does the application decode URL-encoded input before validating pathcomponents?

Are there any input validation or sanitization mechanisms in place to prevent malicious Markdown input from reaching the paragraph function?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a single request force the application server to load unbounded data into memory?

Can malformed input cause an unhandled exception that crashes or restarts the process?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1,28 +0,0 @@
import pytest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Why did you delete this test?

Copy link
Copy Markdown
Author

@tmihalac tmihalac May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _has_c_cpp_sources method was used by TransitiveCodeSearcher to detect if a repo had C/C++ source files. It was removed during the refactoring, the C/C++ detection now relies on the ecosystem being set from the pipeline input (image.ecosystem), not from scanning repo files.

The logic of detecting the ecosystem moved to detect_ecosystem() in dep_tree.py

Will add an updated test file

Comment on lines +32 to +34
"Finds all imports and usage patterns of a package/module across sources. "
"Reports which files import it, usage count, and how it's used. "
"Searches all sources including framework dependencies."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - When code index empty, tool returns No source code indexed , and it let the LLM to infer and understand itself dynamically that the tool shouldn't be invoked and/or that the result is of no use.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# ---- Prompt Templates ---- #

AGENT_SYS_PROMPT = (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac I Would refactor across usages and rename it to REACHABILITY_AGENT_SYS_PROMPT, And also add comment that prefix that set of prompts for the reachability sub-agent, that states that these prompts are for the reachability sub-agent.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

OBSERVATION_NODE_PROMPT = COMPREHENSION_PROMPT

### --- End of REACT Prompt Templates ----#
def build_system_prompt(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac I Would rename + refactor to build_reachability_system_prompt as right now it's confusing.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

try:
with open(full_path, "r", errors="ignore") as f:
content = f.read()
if len(content) < 500_000:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a feasible assumption, but what happens if the config file is bigger than that? it's just ignores it.... Maybe worthwhile generating a warning that states the config file name is ignored because of its size?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

logger = LoggingFactory.get_agent_logger(__name__)


def format_context_snippet(lines: list[str], match_line: int, context_lines: int) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

async def configuration_scanner(config: ConfigurationScannerToolConfig, builder: Builder):
from vuln_analysis.runtime_context import ctx_state, cu_source_scope

_config_files_cache: dict[str, list[tuple[str, str]]] = {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential race condition because write operations on dict are not atomic and is possible when processing concurrently two different CVEs with same repo.

You should use lock on the repo_path/repo_key, similar like we do in https://github.com/tmihalac/vulnerability-analysis/blob/cbbbe6bc24b7214476e2b32b66d5176a629b0654/src/exploit_iq_commons/utils/document_embedding.py#L266-L280

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +113 to +131
def get_instance(cls, cache_path: str, tokenizer=False) -> "FullTextSearch":
"""Return a cached instance for the given path, with LRU eviction.

On cache hit, moves the entry to the end (most recently used).
On cache miss, creates a new instance and evicts the least recently used
if the cache exceeds _INSTANCE_CACHE_MAX.
"""
if cls._instances is None:
cls._instances = OrderedDict()
if cache_path in cls._instances:
cls._instances.move_to_end(cache_path)
return cls._instances[cache_path]
instance = cls(cache_path=cache_path, tokenizer=tokenizer)
cls._instances[cache_path] = instance
if len(cls._instances) > cls._INSTANCE_CACHE_MAX:
evicted = cls._instances.popitem(last=False)
logger.debug("Evicted LRU FullTextSearch instance: %s", evicted[0])
return instance

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac This method being called from two tools , and potentially from multiple tasks concurrently ( checklist items).
the ordered dict is not thread safe, this the write ( move_to_end) and pop item should become atomic ( make these within locked scope).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tmihalac
Copy link
Copy Markdown
Author

/test vulnerability-analysis-on-pr

@zvigrinberg
Copy link
Copy Markdown
Collaborator

@tmihalac Another thing that i forgot to mention, please attach the license at the header of several new source files ( i've noticed it's absent in few cases.

tmihalac added 2 commits May 11, 2026 17:52
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
tmihalac added 10 commits May 11, 2026 18:11
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
… review

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac
Copy link
Copy Markdown
Author

/test-heavy

1 similar comment
@tmihalac
Copy link
Copy Markdown
Author

/test-heavy

_config_files_cache[repo_key] = _collect_config_files(repo_key)
repo_key = (si.git_repo, si.ref)
if repo_key in _config_files_cache:
_config_files_cache.move_to_end(repo_key)
Copy link
Copy Markdown
Collaborator

@zvigrinberg zvigrinberg May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac move_to_end method also should be protected as it's not thread/concurrency safe.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 58 to +71
Examples:
- "Is the vulnerable function XStream.fromXML() called from application code?" → reachability
- "Is the application configured to use the affected XML parser?" → code_understanding
- "Does the application process untrusted XML input from external sources?" → code_understanding
- "Can untrusted data reach BeanUtils.populate() through the call chain?" → reachability
- "Is the vulnerable version of commons-beanutils installed?" → code_understanding
- "Is the function parseXML() reachable from any HTTP handler?" → reachability
- "Does the application enable external entity processing in its XML configuration?" → code_understanding
- "Is SslHandler used in the application?" → reachability (SslHandler is from the vulnerable package)
- "Is HttpPostStandardRequestDecoder.offer() called by application code?" → reachability
- "Does the code sanitize input against path traversal characters and command injection metacharacters?" → code_understanding
- "Are there input validation mechanisms to prevent malicious Markdown input from reaching the paragraph function?" → code_understanding
- "Can a single request force the application server to load unbounded data into memory?" → code_understanding
- "Can malformed input cause an unhandled exception that crashes or restarts the process?" → code_understanding
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Can you add 2 more reachability examples and maybe remove 1 of CU for balancing the numbers of reachability and code understanding examples?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

tmihalac added 2 commits May 12, 2026 09:58
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac
Copy link
Copy Markdown
Author

/test-heavy

1 similar comment
@tmihalac
Copy link
Copy Markdown
Author

/test-heavy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants