This Issue is to track the progress of an evaluation framework for code generation functionality. - [ ] Pass RAG data #3 to large-ish LLM - [ ] Ingest the codebase and dependencies - [ ] Delete functions and ask LLM to recreate them, use projects own tests to evaluate - [ ] Any other coding benchmarks we can think of, with a focus on using the contextual data. Resources: - https://aider.chat - https://github.com/FSoft-AI4Code/RepoHyper/