Added Code Understanding sub agent#231
Conversation
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
|
/test vulnerability-analysis-on-pr |
|
/test-heavy |
|
/test vulnerability-analysis-on-pr |
1 similar comment
|
/test vulnerability-analysis-on-pr |
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
|
/test vulnerability-analysis-on-pr |
|
/test-heavy |
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
|
/test-heavy |
2 similar comments
|
/test-heavy |
|
/test-heavy |
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
|
/test-heavy |
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
|
/test-heavy |
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
|
/test-heavy |
|
|
||
| logger = LoggingFactory.get_agent_logger(__name__) | ||
|
|
||
| _MAX_RHSA_CANDIDATES = 20 |
There was a problem hiding this comment.
@tmihalac Hi theo can you explain your edge case why you need to add a limit on _MAX_RHSA_CANDIDATES and why did you decided to limit only 20
There was a problem hiding this comment.
There was a case with ~1200 and most of them were not even libraries but images so the prompt was ~32k, claude suggested 20
|
/test-heavy |
|
/test vulnerability-analysis-on-pr |
zvigrinberg
left a comment
There was a problem hiding this comment.
@tmihalac Very good job!.
Please see my comments, most of them are minors
|
|
||
| Question: {question} | ||
|
|
||
| Examples: |
There was a problem hiding this comment.
@tmihalac Maybe we can add another one example of code_understanding that is not related to xml or version checking? So it will not be biased toward XML or version only examples.
There was a problem hiding this comment.
For example:
Does the code sanitize input against both path traversal characters AND command injection metacharacters (; | & $ \ \n)?`,
Does the application decode URL-encoded input before validating pathcomponents?
Are there any input validation or sanitization mechanisms in place to prevent malicious Markdown input from reaching the paragraph function?
There was a problem hiding this comment.
Can a single request force the application server to load unbounded data into memory?
Can malformed input cause an unhandled exception that crashes or restarts the process?
| @@ -1,28 +0,0 @@ | |||
| import pytest | |||
There was a problem hiding this comment.
The _has_c_cpp_sources method was used by TransitiveCodeSearcher to detect if a repo had C/C++ source files. It was removed during the refactoring, the C/C++ detection now relies on the ecosystem being set from the pipeline input (image.ecosystem), not from scanning repo files.
The logic of detecting the ecosystem moved to detect_ecosystem() in dep_tree.py
Will add an updated test file
| "Finds all imports and usage patterns of a package/module across sources. " | ||
| "Reports which files import it, usage count, and how it's used. " | ||
| "Searches all sources including framework dependencies." |
There was a problem hiding this comment.
Minor - When code index empty, tool returns No source code indexed , and it let the LLM to infer and understand itself dynamically that the tool shouldn't be invoked and/or that the result is of no use.
|
|
||
| # ---- Prompt Templates ---- # | ||
|
|
||
| AGENT_SYS_PROMPT = ( |
There was a problem hiding this comment.
@tmihalac I Would refactor across usages and rename it to REACHABILITY_AGENT_SYS_PROMPT, And also add comment that prefix that set of prompts for the reachability sub-agent, that states that these prompts are for the reachability sub-agent.
| OBSERVATION_NODE_PROMPT = COMPREHENSION_PROMPT | ||
|
|
||
| ### --- End of REACT Prompt Templates ----# | ||
| def build_system_prompt( |
There was a problem hiding this comment.
@tmihalac I Would rename + refactor to build_reachability_system_prompt as right now it's confusing.
| try: | ||
| with open(full_path, "r", errors="ignore") as f: | ||
| content = f.read() | ||
| if len(content) < 500_000: |
There was a problem hiding this comment.
This is a feasible assumption, but what happens if the config file is bigger than that? it's just ignores it.... Maybe worthwhile generating a warning that states the config file name is ignored because of its size?
| logger = LoggingFactory.get_agent_logger(__name__) | ||
|
|
||
|
|
||
| def format_context_snippet(lines: list[str], match_line: int, context_lines: int) -> str: |
| async def configuration_scanner(config: ConfigurationScannerToolConfig, builder: Builder): | ||
| from vuln_analysis.runtime_context import ctx_state, cu_source_scope | ||
|
|
||
| _config_files_cache: dict[str, list[tuple[str, str]]] = {} |
There was a problem hiding this comment.
Potential race condition because write operations on dict are not atomic and is possible when processing concurrently two different CVEs with same repo.
You should use lock on the repo_path/repo_key, similar like we do in https://github.com/tmihalac/vulnerability-analysis/blob/cbbbe6bc24b7214476e2b32b66d5176a629b0654/src/exploit_iq_commons/utils/document_embedding.py#L266-L280
| def get_instance(cls, cache_path: str, tokenizer=False) -> "FullTextSearch": | ||
| """Return a cached instance for the given path, with LRU eviction. | ||
|
|
||
| On cache hit, moves the entry to the end (most recently used). | ||
| On cache miss, creates a new instance and evicts the least recently used | ||
| if the cache exceeds _INSTANCE_CACHE_MAX. | ||
| """ | ||
| if cls._instances is None: | ||
| cls._instances = OrderedDict() | ||
| if cache_path in cls._instances: | ||
| cls._instances.move_to_end(cache_path) | ||
| return cls._instances[cache_path] | ||
| instance = cls(cache_path=cache_path, tokenizer=tokenizer) | ||
| cls._instances[cache_path] = instance | ||
| if len(cls._instances) > cls._INSTANCE_CACHE_MAX: | ||
| evicted = cls._instances.popitem(last=False) | ||
| logger.debug("Evicted LRU FullTextSearch instance: %s", evicted[0]) | ||
| return instance | ||
|
|
There was a problem hiding this comment.
@tmihalac This method being called from two tools , and potentially from multiple tasks concurrently ( checklist items).
the ordered dict is not thread safe, this the write ( move_to_end) and pop item should become atomic ( make these within locked scope).
|
/test vulnerability-analysis-on-pr |
|
@tmihalac Another thing that i forgot to mention, please attach the license at the header of several new source files ( i've noticed it's absent in few cases. |
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
… review Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
|
/test-heavy |
1 similar comment
|
/test-heavy |
| _config_files_cache[repo_key] = _collect_config_files(repo_key) | ||
| repo_key = (si.git_repo, si.ref) | ||
| if repo_key in _config_files_cache: | ||
| _config_files_cache.move_to_end(repo_key) |
There was a problem hiding this comment.
@tmihalac move_to_end method also should be protected as it's not thread/concurrency safe.
| Examples: | ||
| - "Is the vulnerable function XStream.fromXML() called from application code?" → reachability | ||
| - "Is the application configured to use the affected XML parser?" → code_understanding | ||
| - "Does the application process untrusted XML input from external sources?" → code_understanding | ||
| - "Can untrusted data reach BeanUtils.populate() through the call chain?" → reachability | ||
| - "Is the vulnerable version of commons-beanutils installed?" → code_understanding | ||
| - "Is the function parseXML() reachable from any HTTP handler?" → reachability | ||
| - "Does the application enable external entity processing in its XML configuration?" → code_understanding | ||
| - "Is SslHandler used in the application?" → reachability (SslHandler is from the vulnerable package) | ||
| - "Is HttpPostStandardRequestDecoder.offer() called by application code?" → reachability | ||
| - "Does the code sanitize input against path traversal characters and command injection metacharacters?" → code_understanding | ||
| - "Are there input validation mechanisms to prevent malicious Markdown input from reaching the paragraph function?" → code_understanding | ||
| - "Can a single request force the application server to load unbounded data into memory?" → code_understanding | ||
| - "Can malformed input cause an unhandled exception that crashes or restarts the process?" → code_understanding |
There was a problem hiding this comment.
@tmihalac Can you add 2 more reachability examples and maybe remove 1 of CU for balancing the numbers of reachability and code understanding examples?
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
|
/test-heavy |
1 similar comment
|
/test-heavy |
Fixes https://redhat.atlassian.net/browse/APPENG-4532