Skip to content

infi-coder/deprecated-infibench-evaluation-harness

 
 

Repository files navigation

Inference for InfiCoder-Eval

The InfiCoder Team

Features

This is a very lightweight fork of bigcode-evaluation-harness to support inference on InfiCoder-Eval benchmark prompts.

The setup process and prerequisite are the same as the original bigcode-evaluation-harness framework. There are only some minor changes to the original code (e.g., support max_new_tokens and always use_cache in generation) along with InfiCoder-Eval tasks added.

New tasks for InfiCoder-Eval:

  • code-ffqa-v2

    The default one, prompt with system_prompt + '\n' + content_prompt.

  • code-ffqa-v2-endn

    Prompt with system_prompt + '\n' + content_prompt + '\n'.

  • code-ffqa-v2-deepseek-chat

    deepseek-coder-instruct format

  • code-ffqa-v2-baichuan2

    baichuan2 models format

  • code-ffqa-v2-zypher

    zypher-7b-beta format

  • code-ffqa-v2-octo

    octopack model format

  • code-ffqa-v2-wizard

    wizard-python model format

  • code-ffqa-v2-phi

    phi-1.5 model format

  • code-ffqa-v2-inficoder

    Our InfiCoder model format

For detail information, please visit InfiCoder-Eval.

Usage

For InfiCoder-Eval, we only use this framework for response generation. The actual evaluation is delegated to our Evaluation Repo, which can be deployed in the same instance or another one.

An example usage can be found in run.sh:

# This shell exemplifies how to run the inference for inficoder-eval with this repo
# see detailed instructions in https://infi-coder.github.io/inficoder-eval/

export DATASET_CSV_PATH=..../inficoder-eval-framework/batched_prompts/suite_v2.0.0_dev.csv

# for example, to evaluate Phi-1.5
# first, generate responses
accelerate launch ..../ffqa-evaluation-harness/main.py --model microsoft/phi-1_5 --tasks code-ffqa-v2-phi --batch_size 16 --precision bf16 --n_samples 30 --do_sample True --temperature 0.2 --top_p 0.9 --save_generations --save_references --trust_remote_code --generation_only --max_length_generation 2048 --save_generations_path generations_phi-1_5.json --eos='<|endoftext|>'

# then, join with case names and output a csv file, later the evaluation framework can process
python3 ffqa_processor.py generations_phi-1_5.json references.json ../phi-1_5_output.csv --eos '<|endoftext|>'

A detailed illustration is on our project page: https://infi-coder.github.io/inficoder-eval/.

Implementing new tasks

To implement a new task or prompting method for our InfiCoder-Eval, please read and modify here: bigcode_eval/tasks/code_ffqa_v200.py. For generic task extensions, see the guide in docs/guide. The are also contribution guidelines in this CONTRIBUTING.md

In the long term, we plan to integrate InfiCoder-Eval evaluation framework into this repo and merge this benchmark into the official bigcode-evaluation-harness. If you are interested in this effort, you are more than welcome to contact us!

Acknowledgements

We thank the BigCode team for developing such a great framework and EleutherAI for their work on the lm-evaluation harness from which this repository is built upon.

Cite as

@misc{li2023inficodereval,
  author = {InfiCoderTeam},
  title = {InfiCoder-Eval: Systematically Evaluating Question-Answering for Code Large Language Models},
  year = {2023},
  publisher = {Github Pages},
  howpublished = "\url{https://infi-coder.github.io/inficoder-eval/}"
}

About

Customized inference for InfiCoder-Eval adapted from bigcode-evaluation-harness.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.8%
  • Other 1.2%