The following is the complete log at the end of the training, including loss information, one piece of validation information, and test information.
�[A{'eval_loss': 1.3073723316192627, 'eval_ccot_exact_match': 0.28, 'eval_cot_exact_match': 0.638, 'eval_runtime': 2.1683, 'eval_samples_per_second': 230.591, 'eval_steps_per_second': 3.689, 'epoch': 9.96}
100%|█████████▉| 30000/30130 [3:47:36<01:51, 1.17it/s]
100%|██████████| 8/8 [00:01<00:00, 4.58it/s]�[A
�[A[INFO|trainer.py:3812] 2026-01-08 13:42:55,416 >> Saving model checkpoint to outputs/pcot-llama1binst-lora-3-24/checkpoint-30000
[INFO|hub.py:363] 2026-01-08 13:42:55,594 >> Offline mode: forcing local_files_only=True
[INFO|configuration_utils.py:677] 2026-01-08 13:42:55,597 >> loading configuration file /media/cfs/products-understanding-nlp/gaotianhao/huggingface/Llama-3.2-1B-Instruct/config.json
[WARNING|configuration_utils.py:547] 2026-01-08 13:42:55,597 >> You are using a model of type llama to instantiate a model of type pccot-llama. This is not supported for all configurations of models and can yield errors.
[INFO|configuration_utils.py:746] 2026-01-08 13:42:55,598 >> Model config PCCoTLlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
],
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"loss_alpha": 1.0,
"loss_beta": 1.0,
"loss_gamma": 1.0,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "pccot-llama",
"num_attention_heads": 32,
"num_hidden_layers": 16,
"num_iterations": 6,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 32.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.3",
"use_cache": true,
"use_layerwise_std": true,
"use_projection": false,
"vocab_size": 128256
}
/usr/local/miniconda3/lib/python3.10/site-packages/peft/utils/save_and_load.py:300: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
warnings.warn(
[INFO|tokenization_utils_base.py:2646] 2026-01-08 13:43:00,169 >> tokenizer config file saved in outputs/pcot-llama1binst-lora-3-24/checkpoint-30000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2026-01-08 13:43:00,183 >> Special tokens file saved in outputs/pcot-llama1binst-lora-3-24/checkpoint-30000/special_tokens_map.json
[INFO|trainer.py:3904] 2026-01-08 13:43:02,875 >> Deleting older checkpoint [outputs/pcot-llama1binst-lora-3-24/checkpoint-29000] due to args.save_total_limit
100%|█████████▉| 30001/30130 [3:47:44<08:13, 3.83s/it]
100%|█████████▉| 30002/30130 [3:47:45<06:13, 2.92s/it]
100%|█████████▉| 30003/30130 [3:47:46<04:50, 2.29s/it]
100%|█████████▉| 30004/30130 [3:47:47<03:55, 1.87s/it]
100%|█████████▉| 30005/30130 [3:47:48<03:13, 1.54s/it]
100%|█████████▉| 30006/30130 [3:47:49<02:49, 1.37s/it]
100%|█████████▉| 30007/30130 [3:47:50<02:28, 1.21s/it]
100%|█████████▉| 30008/30130 [3:47:50<02:10, 1.07s/it]
100%|█████████▉| 30009/30130 [3:47:51<01:57, 1.03it/s]
100%|█████████▉| 30010/30130 [3:47:52<01:50, 1.08it/s]
100%|█████████▉| 30011/30130 [3:47:53<01:44, 1.14it/s]
100%|█████████▉| 30012/30130 [3:47:53<01:39, 1.19it/s]
100%|█████████▉| 30013/30130 [3:47:54<01:36, 1.22it/s]
100%|█████████▉| 30014/30130 [3:47:55<01:33, 1.24it/s]
100%|█████████▉| 30015/30130 [3:47:56<01:32, 1.25it/s]
100%|█████████▉| 30016/30130 [3:47:56<01:27, 1.30it/s]
100%|█████████▉| 30017/30130 [3:47:57<01:27, 1.29it/s]
100%|█████████▉| 30018/30130 [3:47:58<01:28, 1.26it/s]
100%|█████████▉| 30019/30130 [3:47:59<01:28, 1.26it/s]
100%|█████████▉| 30020/30130 [3:48:00<01:27, 1.26it/s]
{'loss': 0.2209, 'grad_norm': 0.8483083248138428, 'learning_rate': 2.7962197024145042e-09, 'epoch': 9.96}
100%|█████████▉| 30020/30130 [3:48:00<01:27, 1.26it/s]
100%|█████████▉| 30021/30130 [3:48:00<01:28, 1.24it/s]
100%|█████████▉| 30022/30130 [3:48:01<01:28, 1.22it/s]
100%|█████████▉| 30023/30130 [3:48:02<01:30, 1.18it/s]
100%|█████████▉| 30024/30130 [3:48:03<01:29, 1.19it/s]
100%|█████████▉| 30025/30130 [3:48:04<01:27, 1.20it/s]
100%|█████████▉| 30026/30130 [3:48:05<01:26, 1.20it/s]
100%|█████████▉| 30027/30130 [3:48:05<01:24, 1.22it/s]
100%|█████████▉| 30028/30130 [3:48:06<01:23, 1.22it/s]
100%|█████████▉| 30029/30130 [3:48:07<01:24, 1.19it/s]
100%|█████████▉| 30030/30130 [3:48:08<01:24, 1.18it/s]
100%|█████████▉| 30031/30130 [3:48:09<01:23, 1.19it/s]
100%|█████████▉| 30032/30130 [3:48:10<01:23, 1.17it/s]
100%|█████████▉| 30033/30130 [3:48:11<01:21, 1.19it/s]
100%|█████████▉| 30034/30130 [3:48:11<01:22, 1.17it/s]
100%|█████████▉| 30035/30130 [3:48:12<01:18, 1.22it/s]
100%|█████████▉| 30036/30130 [3:48:13<01:15, 1.24it/s]
100%|█████████▉| 30037/30130 [3:48:14<01:15, 1.23it/s]
100%|█████████▉| 30038/30130 [3:48:15<01:18, 1.18it/s]
100%|█████████▉| 30039/30130 [3:48:16<01:17, 1.17it/s]
100%|█████████▉| 30040/30130 [3:48:16<01:14, 1.21it/s]
{'loss': 0.2132, 'grad_norm': 0.835830807685852, 'learning_rate': 1.871856762476476e-09, 'epoch': 9.97}
100%|█████████▉| 30040/30130 [3:48:16<01:14, 1.21it/s]
100%|█████████▉| 30041/30130 [3:48:17<01:14, 1.19it/s]
100%|█████████▉| 30042/30130 [3:48:18<01:12, 1.22it/s]
100%|█████████▉| 30043/30130 [3:48:19<01:12, 1.20it/s]
100%|█████████▉| 30044/30130 [3:48:20<01:10, 1.23it/s]
100%|█████████▉| 30045/30130 [3:48:20<01:08, 1.23it/s]
100%|█████████▉| 30046/30130 [3:48:21<01:09, 1.21it/s]
100%|█████████▉| 30047/30130 [3:48:22<01:08, 1.21it/s]
100%|█████████▉| 30048/30130 [3:48:23<01:06, 1.23it/s]
100%|█████████▉| 30049/30130 [3:48:24<01:06, 1.22it/s]
100%|█████████▉| 30050/30130 [3:48:25<01:05, 1.22it/s]
100%|█████████▉| 30051/30130 [3:48:25<01:06, 1.19it/s]
100%|█████████▉| 30052/30130 [3:48:26<01:05, 1.19it/s]
100%|█████████▉| 30053/30130 [3:48:27<01:07, 1.15it/s]
100%|█████████▉| 30054/30130 [3:48:28<01:08, 1.11it/s]
100%|█████████▉| 30055/30130 [3:48:29<01:05, 1.14it/s]
100%|█████████▉| 30056/30130 [3:48:30<01:11, 1.03it/s]
100%|█████████▉| 30057/30130 [3:48:31<01:09, 1.04it/s]
100%|█████████▉| 30058/30130 [3:48:32<01:10, 1.02it/s]
100%|█████████▉| 30059/30130 [3:48:33<01:05, 1.09it/s]
100%|█████████▉| 30060/30130 [3:48:34<01:01, 1.14it/s]
{'loss': 0.217, 'grad_norm': 0.6908845901489258, 'learning_rate': 1.1323612836955379e-09, 'epoch': 9.98}
100%|█████████▉| 30060/30130 [3:48:34<01:01, 1.14it/s]
100%|█████████▉| 30061/30130 [3:48:35<00:58, 1.17it/s]
100%|█████████▉| 30062/30130 [3:48:35<00:58, 1.17it/s]
100%|█████████▉| 30063/30130 [3:48:36<00:57, 1.16it/s]
100%|█████████▉| 30064/30130 [3:48:37<00:57, 1.16it/s]
100%|█████████▉| 30065/30130 [3:48:38<00:54, 1.19it/s]
100%|█████████▉| 30066/30130 [3:48:39<00:53, 1.20it/s]
100%|█████████▉| 30067/30130 [3:48:40<00:55, 1.13it/s]
100%|█████████▉| 30068/30130 [3:48:41<00:54, 1.13it/s]
100%|█████████▉| 30069/30130 [3:48:41<00:53, 1.15it/s]
100%|█████████▉| 30070/30130 [3:48:42<00:50, 1.18it/s]
100%|█████████▉| 30071/30130 [3:48:43<00:50, 1.18it/s]
100%|█████████▉| 30072/30130 [3:48:44<00:48, 1.21it/s]
100%|█████████▉| 30073/30130 [3:48:45<00:51, 1.11it/s]
100%|█████████▉| 30074/30130 [3:48:46<00:50, 1.11it/s]
100%|█████████▉| 30075/30130 [3:48:47<00:49, 1.11it/s]
100%|█████████▉| 30076/30130 [3:48:47<00:46, 1.17it/s]
100%|█████████▉| 30077/30130 [3:48:48<00:44, 1.18it/s]
100%|█████████▉| 30078/30130 [3:48:49<00:43, 1.21it/s]
100%|█████████▉| 30079/30130 [3:48:50<00:41, 1.24it/s]
100%|█████████▉| 30080/30130 [3:48:51<00:41, 1.21it/s]
{'loss': 0.2182, 'grad_norm': 0.8773943185806274, 'learning_rate': 5.777366839465614e-10, 'epoch': 9.98}
100%|█████████▉| 30080/30130 [3:48:51<00:41, 1.21it/s]
100%|█████████▉| 30081/30130 [3:48:52<00:40, 1.22it/s]
100%|█████████▉| 30082/30130 [3:48:52<00:38, 1.23it/s]
100%|█████████▉| 30083/30130 [3:48:53<00:39, 1.18it/s]
100%|█████████▉| 30084/30130 [3:48:54<00:37, 1.21it/s]
100%|█████████▉| 30085/30130 [3:48:55<00:36, 1.23it/s]
100%|█████████▉| 30086/30130 [3:48:56<00:35, 1.25it/s]
100%|█████████▉| 30087/30130 [3:48:56<00:35, 1.22it/s]
100%|█████████▉| 30088/30130 [3:48:57<00:33, 1.24it/s]
100%|█████████▉| 30089/30130 [3:48:58<00:34, 1.18it/s]
100%|█████████▉| 30090/30130 [3:48:59<00:32, 1.22it/s]
100%|█████████▉| 30091/30130 [3:49:00<00:33, 1.17it/s]
100%|█████████▉| 30092/30130 [3:49:01<00:32, 1.18it/s]
100%|█████████▉| 30093/30130 [3:49:01<00:30, 1.21it/s]
100%|█████████▉| 30094/30130 [3:49:02<00:30, 1.20it/s]
100%|█████████▉| 30095/30130 [3:49:03<00:28, 1.21it/s]
100%|█████████▉| 30096/30130 [3:49:04<00:28, 1.21it/s]
100%|█████████▉| 30097/30130 [3:49:05<00:26, 1.24it/s]
100%|█████████▉| 30098/30130 [3:49:06<00:25, 1.25it/s]
100%|█████████▉| 30099/30130 [3:49:06<00:24, 1.24it/s]
100%|█████████▉| 30100/30130 [3:49:07<00:24, 1.22it/s]
{'loss': 0.2089, 'grad_norm': 0.8249523639678955, 'learning_rate': 2.0798552665013404e-10, 'epoch': 9.99}
100%|█████████▉| 30100/30130 [3:49:07<00:24, 1.22it/s]
100%|█████████▉| 30101/30130 [3:49:08<00:23, 1.24it/s]
100%|█████████▉| 30102/30130 [3:49:09<00:22, 1.24it/s]
100%|█████████▉| 30103/30130 [3:49:10<00:21, 1.25it/s]
100%|█████████▉| 30104/30130 [3:49:10<00:21, 1.22it/s]
100%|█████████▉| 30105/30130 [3:49:11<00:20, 1.25it/s]
100%|█████████▉| 30106/30130 [3:49:12<00:19, 1.24it/s]
100%|█████████▉| 30107/30130 [3:49:13<00:18, 1.25it/s]
100%|█████████▉| 30108/30130 [3:49:14<00:17, 1.24it/s]
100%|█████████▉| 30109/30130 [3:49:14<00:16, 1.26it/s]
100%|█████████▉| 30110/30130 [3:49:15<00:16, 1.24it/s]
100%|█████████▉| 30111/30130 [3:49:16<00:15, 1.26it/s]
100%|█████████▉| 30112/30130 [3:49:17<00:14, 1.26it/s]
100%|█████████▉| 30113/30130 [3:49:18<00:13, 1.26it/s]
100%|█████████▉| 30114/30130 [3:49:18<00:12, 1.29it/s]
100%|█████████▉| 30115/30130 [3:49:19<00:11, 1.27it/s]
100%|█████████▉| 30116/30130 [3:49:20<00:11, 1.24it/s]
100%|█████████▉| 30117/30130 [3:49:21<00:10, 1.20it/s]
100%|█████████▉| 30118/30130 [3:49:22<00:09, 1.22it/s]
100%|█████████▉| 30119/30130 [3:49:22<00:08, 1.27it/s]
100%|█████████▉| 30120/30130 [3:49:23<00:07, 1.28it/s]
{'loss': 0.2235, 'grad_norm': 0.7961286902427673, 'learning_rate': 2.3109520763675565e-11, 'epoch': 10.0}
100%|█████████▉| 30120/30130 [3:49:23<00:07, 1.28it/s]
100%|█████████▉| 30121/30130 [3:49:24<00:06, 1.30it/s]
100%|█████████▉| 30122/30130 [3:49:25<00:06, 1.26it/s]
100%|█████████▉| 30123/30130 [3:49:26<00:05, 1.23it/s]
100%|█████████▉| 30124/30130 [3:49:26<00:04, 1.23it/s]
100%|█████████▉| 30125/30130 [3:49:27<00:04, 1.24it/s]
100%|█████████▉| 30126/30130 [3:49:28<00:03, 1.25it/s]
100%|█████████▉| 30127/30130 [3:49:29<00:02, 1.20it/s]
100%|█████████▉| 30128/30130 [3:49:30<00:01, 1.23it/s]
100%|█████████▉| 30129/30130 [3:49:31<00:00, 1.19it/s]
100%|██████████| 30130/30130 [3:49:31<00:00, 1.15it/s][INFO|trainer.py:3812] 2026-01-08 13:44:50,982 >> Saving model checkpoint to outputs/pcot-llama1binst-lora-3-24/checkpoint-30130
/usr/local/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
/usr/local/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
/usr/local/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[INFO|hub.py:363] 2026-01-08 13:44:51,060 >> Offline mode: forcing local_files_only=True
[INFO|configuration_utils.py:677] 2026-01-08 13:44:51,063 >> loading configuration file /media/cfs/products-understanding-nlp/gaotianhao/huggingface/Llama-3.2-1B-Instruct/config.json
[WARNING|configuration_utils.py:547] 2026-01-08 13:44:51,063 >> You are using a model of type llama to instantiate a model of type pccot-llama. This is not supported for all configurations of models and can yield errors.
[INFO|configuration_utils.py:746] 2026-01-08 13:44:51,064 >> Model config PCCoTLlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
],
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"loss_alpha": 1.0,
"loss_beta": 1.0,
"loss_gamma": 1.0,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "pccot-llama",
"num_attention_heads": 32,
"num_hidden_layers": 16,
"num_iterations": 6,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 32.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.3",
"use_cache": true,
"use_layerwise_std": true,
"use_projection": false,
"vocab_size": 128256
}
/usr/local/miniconda3/lib/python3.10/site-packages/peft/utils/save_and_load.py:300: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
warnings.warn(
[INFO|tokenization_utils_base.py:2646] 2026-01-08 13:44:55,466 >> tokenizer config file saved in outputs/pcot-llama1binst-lora-3-24/checkpoint-30130/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2026-01-08 13:44:55,480 >> Special tokens file saved in outputs/pcot-llama1binst-lora-3-24/checkpoint-30130/special_tokens_map.json
[INFO|trainer.py:3904] 2026-01-08 13:44:57,627 >> Deleting older checkpoint [outputs/pcot-llama1binst-lora-3-24/checkpoint-30000] due to args.save_total_limit
[INFO|trainer.py:2591] 2026-01-08 13:44:57,822 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
/usr/local/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[INFO|trainer.py:2829] 2026-01-08 13:44:57,822 >> Loading best model from outputs/pcot-llama1binst-lora-3-24/checkpoint-22000 (score: 0.3).
{'train_runtime': 13779.1716, 'train_samples_per_second': 279.857, 'train_steps_per_second': 2.187, 'train_loss': 0.14304465817091594, 'epoch': 10.0}
100%|██████████| 30130/30130 [3:49:39<00:00, 1.15it/s][INFO|trainer.py:2632] 2026-01-08 13:44:58,232 >> Deleting older checkpoint [outputs/pcot-llama1binst-lora-3-24/checkpoint-30130] due to args.save_total_limit
100%|██████████| 30130/30130 [3:49:39<00:00, 2.19it/s]
[INFO|trainer.py:3812] 2026-01-08 13:44:58,427 >> Saving model checkpoint to outputs/pcot-llama1binst-lora-3-24
[INFO|hub.py:363] 2026-01-08 13:44:58,469 >> Offline mode: forcing local_files_only=True
[INFO|configuration_utils.py:677] 2026-01-08 13:44:58,472 >> loading configuration file /media/cfs/products-understanding-nlp/gaotianhao/huggingface/Llama-3.2-1B-Instruct/config.json
[WARNING|configuration_utils.py:547] 2026-01-08 13:44:58,472 >> You are using a model of type llama to instantiate a model of type pccot-llama. This is not supported for all configurations of models and can yield errors.
[INFO|configuration_utils.py:746] 2026-01-08 13:44:58,472 >> Model config PCCoTLlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
],
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"loss_alpha": 1.0,
"loss_beta": 1.0,
"loss_gamma": 1.0,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "pccot-llama",
"num_attention_heads": 32,
"num_hidden_layers": 16,
"num_iterations": 6,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 32.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.3",
"use_cache": true,
"use_layerwise_std": true,
"use_projection": false,
"vocab_size": 128256
}
/usr/local/miniconda3/lib/python3.10/site-packages/peft/utils/save_and_load.py:300: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
warnings.warn(
[INFO|tokenization_utils_base.py:2646] 2026-01-08 13:45:02,238 >> tokenizer config file saved in outputs/pcot-llama1binst-lora-3-24/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2026-01-08 13:45:02,252 >> Special tokens file saved in outputs/pcot-llama1binst-lora-3-24/special_tokens_map.json
[INFO|configuration_utils.py:414] 2026-01-08 13:45:02,487 >> Configuration saved in outputs/pcot-llama1binst-lora-3-24/config.json
***** train metrics *****
epoch = 10.0
total_flos = 2608683236GF
train_loss = 0.143
train_runtime = 3:49:39.17
train_samples = 385620
train_samples_per_second = 279.857
train_steps_per_second = 2.187
01/08/2026 13:45:02 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:4128] 2026-01-08 13:45:02,571 >>
***** Running Evaluation *****
[INFO|trainer.py:4130] 2026-01-08 13:45:02,571 >> Num examples = 500
[INFO|trainer.py:4133] 2026-01-08 13:45:02,571 >> Batch size = 16
0%| | 0/8 [00:00<?, ?it/s]
25%|██▌ | 2/8 [00:00<00:00, 9.17it/s]
38%|███▊ | 3/8 [00:00<00:00, 6.09it/s]
50%|█████ | 4/8 [00:00<00:00, 5.25it/s]
62%|██████▎ | 5/8 [00:00<00:00, 4.64it/s]
75%|███████▌ | 6/8 [00:01<00:00, 4.65it/s]
88%|████████▊ | 7/8 [00:01<00:00, 4.51it/s]
100%|██████████| 8/8 [00:01<00:00, 4.54it/s]01/08/2026 13:45:04 - INFO - __main__ - CCoT Results
01/08/2026 13:45:04 - INFO - __main__ - pred 0: The answer is:150
01/08/2026 13:45:04 - INFO - __main__ - label 0: The answer is:300
01/08/2026 13:45:04 - INFO - __main__ - pred 1: The answer is:10
01/08/2026 13:45:04 - INFO - __main__ - label 1: The answer is:10
01/08/2026 13:45:04 - INFO - __main__ - pred 2: The answer is:1400
01/08/2026 13:45:04 - INFO - __main__ - label 2: The answer is:1400
01/08/2026 13:45:04 - INFO - __main__ - pred 3: The answer is:15
01/08/2026 13:45:04 - INFO - __main__ - label 3: The answer is:15
01/08/2026 13:45:04 - INFO - __main__ - pred 4: The answer is:240
01/08/2026 13:45:04 - INFO - __main__ - label 4: The answer is:240
01/08/2026 13:45:04 - INFO - __main__ - pred 5: The answer is:10
01/08/2026 13:45:04 - INFO - __main__ - label 5: The answer is:20
01/08/2026 13:45:04 - INFO - __main__ - pred 6: The answer is:10
01/08/2026 13:45:04 - INFO - __main__ - label 6: The answer is:10
01/08/2026 13:45:04 - INFO - __main__ - pred 7: The answer is:2.The
01/08/2026 13:45:04 - INFO - __main__ - label 7: The answer is:2
01/08/2026 13:45:04 - INFO - __main__ - pred 8: The answer is:22.The
01/08/2026 13:45:04 - INFO - __main__ - label 8: The answer is:25
01/08/2026 13:45:04 - INFO - __main__ - pred 9: The answer is:26.The
01/08/2026 13:45:04 - INFO - __main__ - label 9: The answer is:25
01/08/2026 13:45:04 - INFO - __main__ - CoT Results
01/08/2026 13:45:04 - INFO - __main__ - pred 0: 100
01/08/2026 13:45:04 - INFO - __main__ - label 0: 300
01/08/2026 13:45:04 - INFO - __main__ - pred 1: 10
01/08/2026 13:45:04 - INFO - __main__ - label 1: 10
01/08/2026 13:45:04 - INFO - __main__ - pred 2: 1400
01/08/2026 13:45:04 - INFO - __main__ - label 2: 1400
01/08/2026 13:45:04 - INFO - __main__ - pred 3: 15
01/08/2026 13:45:04 - INFO - __main__ - label 3: 15
01/08/2026 13:45:04 - INFO - __main__ - pred 4: 240
01/08/2026 13:45:04 - INFO - __main__ - label 4: 240
01/08/2026 13:45:04 - INFO - __main__ - pred 5: 20
01/08/2026 13:45:04 - INFO - __main__ - label 5: 20
01/08/2026 13:45:04 - INFO - __main__ - pred 6: 20
01/08/2026 13:45:04 - INFO - __main__ - label 6: 10
01/08/2026 13:45:04 - INFO - __main__ - pred 7: 2
01/08/2026 13:45:04 - INFO - __main__ - label 7: 2
01/08/2026 13:45:04 - INFO - __main__ - pred 8: 22.
01/08/2026 13:45:04 - INFO - __main__ - label 8: 25
01/08/2026 13:45:04 - INFO - __main__ - pred 9: 28.
01/08/2026 13:45:04 - INFO - __main__ - label 9: 25
100%|██████████| 8/8 [00:01<00:00, 4.12it/s]
***** eval metrics *****
epoch = 10.0
eval_ccot_exact_match = 0.3
eval_cot_exact_match = 0.632
eval_loss = 1.1757
eval_perplexity = 3.2404
eval_runtime = 0:00:02.19
eval_samples = 500
eval_samples_per_second = 227.743
eval_steps_per_second = 3.644
01/08/2026 13:45:04 - INFO - __main__ - *** Predict ***
[INFO|trainer.py:4128] 2026-01-08 13:45:04,806 >>
***** Running Prediction *****
[INFO|trainer.py:4130] 2026-01-08 13:45:04,806 >> Num examples = 1319
[INFO|trainer.py:4133] 2026-01-08 13:45:04,806 >> Batch size = 16
0%| | 0/21 [00:00<?, ?it/s]
10%|▉ | 2/21 [00:00<00:02, 9.13it/s]
14%|█▍ | 3/21 [00:00<00:02, 6.14it/s]
19%|█▉ | 4/21 [00:00<00:03, 5.43it/s]
24%|██▍ | 5/21 [00:00<00:03, 5.14it/s]
29%|██▊ | 6/21 [00:01<00:03, 4.95it/s]
33%|███▎ | 7/21 [00:01<00:02, 4.80it/s]
38%|███▊ | 8/21 [00:01<00:02, 4.58it/s]
43%|████▎ | 9/21 [00:01<00:02, 4.58it/s]
48%|████▊ | 10/21 [00:02<00:02, 4.60it/s]
52%|█████▏ | 11/21 [00:02<00:02, 4.57it/s]
57%|█████▋ | 12/21 [00:02<00:01, 4.57it/s]
62%|██████▏ | 13/21 [00:02<00:01, 4.48it/s]
67%|██████▋ | 14/21 [00:02<00:01, 4.42it/s]
71%|███████▏ | 15/21 [00:03<00:01, 4.37it/s]
76%|███████▌ | 16/21 [00:03<00:01, 4.15it/s]
81%|████████ | 17/21 [00:03<00:01, 3.93it/s]
86%|████████▌ | 18/21 [00:03<00:00, 3.99it/s]
90%|█████████ | 19/21 [00:04<00:00, 3.93it/s]
95%|█████████▌| 20/21 [00:04<00:00, 4.03it/s]
100%|██████████| 21/21 [00:04<00:00, 4.12it/s]01/08/2026 13:45:09 - INFO - __main__ - CCoT Results
01/08/2026 13:45:09 - INFO - __main__ - pred 0: The answer is:18
01/08/2026 13:45:09 - INFO - __main__ - label 0: The answer is:18
01/08/2026 13:45:09 - INFO - __main__ - pred 1: The answer is:3
01/08/2026 13:45:09 - INFO - __main__ - label 1: The answer is:3
01/08/2026 13:45:09 - INFO - __main__ - pred 2: The answer is:25000
01/08/2026 13:45:09 - INFO - __main__ - label 2: The answer is:70000
01/08/2026 13:45:09 - INFO - __main__ - pred 3: The answer is:540
01/08/2026 13:45:09 - INFO - __main__ - label 3: The answer is:540
01/08/2026 13:45:09 - INFO - __main__ - pred 4: The answer is:25
01/08/2026 13:45:09 - INFO - __main__ - label 4: The answer is:20
01/08/2026 13:45:09 - INFO - __main__ - pred 5: The answer is:56
01/08/2026 13:45:09 - INFO - __main__ - label 5: The answer is:64
01/08/2026 13:45:09 - INFO - __main__ - pred 6: The answer is:260
01/08/2026 13:45:09 - INFO - __main__ - label 6: The answer is:260
01/08/2026 13:45:09 - INFO - __main__ - pred 7: The answer is:440
01/08/2026 13:45:09 - INFO - __main__ - label 7: The answer is:160
01/08/2026 13:45:09 - INFO - __main__ - pred 8: The answer is:270
01/08/2026 13:45:09 - INFO - __main__ - label 8: The answer is:45
01/08/2026 13:45:09 - INFO - __main__ - pred 9: The answer is:510
01/08/2026 13:45:09 - INFO - __main__ - label 9: The answer is:460
01/08/2026 13:45:10 - INFO - __main__ - CoT Results
01/08/2026 13:45:10 - INFO - __main__ - pred 0: <<16-3-4=9>><< answer is:18
01/08/2026 13:45:10 - INFO - __main__ - label 0: 18
01/08/2026 13:45:10 - INFO - __main__ - pred 1: 3
01/08/2026 13:45:10 - INFO - __main__ - label 1: 3
01/08/2026 13:45:10 - INFO - __main__ - pred 2: 70000
01/08/2026 13:45:10 - INFO - __main__ - label 2: 70000
01/08/2026 13:45:10 - INFO - __main__ - pred 3: 540
01/08/2026 13:45:10 - INFO - __main__ - label 3: 540
01/08/2026 13:45:10 - INFO - __main__ - pred 4: <<20*20=60>><< answer is:5
01/08/2026 13:45:10 - INFO - __main__ - label 4: 20
01/08/2026 13:45:10 - INFO - __main__ - pred 5: 64
01/08/2026 13:45:10 - INFO - __main__ - label 5: 64
01/08/2026 13:45:10 - INFO - __main__ - pred 6: 260
01/08/2026 13:45:10 - INFO - __main__ - label 6: 260
01/08/2026 13:45:10 - INFO - __main__ - pred 7: <<200*0*.01=80>><<200/2=40>><<200-2=100>><< answer is:240
01/08/2026 13:45:10 - INFO - __main__ - label 7: 160
01/08/2026 13:45:10 - INFO - __main__ - pred 8: 45
01/08/2026 13:45:10 - INFO - __main__ - label 8: 45
01/08/2026 13:45:10 - INFO - __main__ - pred 9: 460
01/08/2026 13:45:10 - INFO - __main__ - label 9: 460
100%|██████████| 21/21 [00:05<00:00, 3.93it/s]
***** test metrics *****
test_ccot_exact_match = 0.2707
test_cot_exact_match = 0.6361
test_loss = 1.1958
test_perplexity = 3.3062
test_runtime = 0:00:05.58
test_samples = 1319
test_samples_per_second = 236.114
test_steps_per_second = 3.759
[INFO|modelcard.py:449] 2026-01-08 13:45:10,451 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'dataset': {'name': './data/whynlp-gsm8k-aug', 'type': './data/whynlp-gsm8k-aug'}}
gaotianhao1-ecda057c:3408:3494 [0] NCCL INFO [Service thread] Connection closed by localRank 3
gaotianhao1-ecda057c:3409:3493 [0] NCCL INFO [Service thread] Connection closed by localRank 2
I am attempting to reproduce the results using
Llama-3.2-1B-Instructon the GSM8K dataset. However, I am observing a significant discrepancy in the finaltest_ccot_exact_matchscore compared to the paper (getting ~27.07% vs the reported ~53.35%).The following is the complete log at the end of the training, including loss information, one piece of validation information, and test information.
My training script is as follows:
I manually lowered the
--learning_rate to 8e-5. When I used the default 8e-4, I observed severe loss oscillation and abnormally high grad_norm values during the initial phase of training. Is this the key difference? If not, could you help me troubleshoot the problem?