Can't reproduce result on Llama-3.2-1B-Instruct

I am attempting to reproduce the results using `Llama-3.2-1B-Instruct` on the GSM8K dataset. However, I am observing a significant discrepancy in the final `test_ccot_exact_match` score compared to the paper (getting **~27.07%** vs the reported **~53.35%**).

The following is the complete log at the end of the training, including loss information, one piece of validation information, and test information.
```
[A{'eval_loss': 1.3073723316192627, 'eval_ccot_exact_match': 0.28, 'eval_cot_exact_match': 0.638, 'eval_runtime': 2.1683, 'eval_samples_per_second': 230.591, 'eval_steps_per_second': 3.689, 'epoch': 9.96}

100%|█████████▉| 30000/30130 [3:47:36<01:51,  1.17it/s]

100%|██████████| 8/8 [00:01<00:00,  4.58it/s][A

                                             [A[INFO|trainer.py:3812] 2026-01-08 13:42:55,416 >> Saving model checkpoint to outputs/pcot-llama1binst-lora-3-24/checkpoint-30000
[INFO|hub.py:363] 2026-01-08 13:42:55,594 >> Offline mode: forcing local_files_only=True
[INFO|configuration_utils.py:677] 2026-01-08 13:42:55,597 >> loading configuration file /media/cfs/products-understanding-nlp/gaotianhao/huggingface/Llama-3.2-1B-Instruct/config.json
[WARNING|configuration_utils.py:547] 2026-01-08 13:42:55,597 >> You are using a model of type llama to instantiate a model of type pccot-llama. This is not supported for all configurations of models and can yield errors.
[INFO|configuration_utils.py:746] 2026-01-08 13:42:55,598 >> Model config PCCoTLlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "loss_alpha": 1.0,
  "loss_beta": 1.0,
  "loss_gamma": 1.0,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "pccot-llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_iterations": 6,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.3",
  "use_cache": true,
  "use_layerwise_std": true,
  "use_projection": false,
  "vocab_size": 128256
}

/usr/local/miniconda3/lib/python3.10/site-packages/peft/utils/save_and_load.py:300: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
  warnings.warn(
[INFO|tokenization_utils_base.py:2646] 2026-01-08 13:43:00,169 >> tokenizer config file saved in outputs/pcot-llama1binst-lora-3-24/checkpoint-30000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2026-01-08 13:43:00,183 >> Special tokens file saved in outputs/pcot-llama1binst-lora-3-24/checkpoint-30000/special_tokens_map.json
[INFO|trainer.py:3904] 2026-01-08 13:43:02,875 >> Deleting older checkpoint [outputs/pcot-llama1binst-lora-3-24/checkpoint-29000] due to args.save_total_limit

100%|█████████▉| 30001/30130 [3:47:44<08:13,  3.83s/it]
100%|█████████▉| 30002/30130 [3:47:45<06:13,  2.92s/it]
100%|█████████▉| 30003/30130 [3:47:46<04:50,  2.29s/it]
100%|█████████▉| 30004/30130 [3:47:47<03:55,  1.87s/it]
100%|█████████▉| 30005/30130 [3:47:48<03:13,  1.54s/it]
100%|█████████▉| 30006/30130 [3:47:49<02:49,  1.37s/it]
100%|█████████▉| 30007/30130 [3:47:50<02:28,  1.21s/it]
100%|█████████▉| 30008/30130 [3:47:50<02:10,  1.07s/it]
100%|█████████▉| 30009/30130 [3:47:51<01:57,  1.03it/s]
100%|█████████▉| 30010/30130 [3:47:52<01:50,  1.08it/s]
100%|█████████▉| 30011/30130 [3:47:53<01:44,  1.14it/s]
100%|█████████▉| 30012/30130 [3:47:53<01:39,  1.19it/s]
100%|█████████▉| 30013/30130 [3:47:54<01:36,  1.22it/s]
100%|█████████▉| 30014/30130 [3:47:55<01:33,  1.24it/s]
100%|█████████▉| 30015/30130 [3:47:56<01:32,  1.25it/s]
100%|█████████▉| 30016/30130 [3:47:56<01:27,  1.30it/s]
100%|█████████▉| 30017/30130 [3:47:57<01:27,  1.29it/s]
100%|█████████▉| 30018/30130 [3:47:58<01:28,  1.26it/s]
100%|█████████▉| 30019/30130 [3:47:59<01:28,  1.26it/s]
100%|█████████▉| 30020/30130 [3:48:00<01:27,  1.26it/s]
                                                       
{'loss': 0.2209, 'grad_norm': 0.8483083248138428, 'learning_rate': 2.7962197024145042e-09, 'epoch': 9.96}

100%|█████████▉| 30020/30130 [3:48:00<01:27,  1.26it/s]
100%|█████████▉| 30021/30130 [3:48:00<01:28,  1.24it/s]
100%|█████████▉| 30022/30130 [3:48:01<01:28,  1.22it/s]
100%|█████████▉| 30023/30130 [3:48:02<01:30,  1.18it/s]
100%|█████████▉| 30024/30130 [3:48:03<01:29,  1.19it/s]
100%|█████████▉| 30025/30130 [3:48:04<01:27,  1.20it/s]
100%|█████████▉| 30026/30130 [3:48:05<01:26,  1.20it/s]
100%|█████████▉| 30027/30130 [3:48:05<01:24,  1.22it/s]
100%|█████████▉| 30028/30130 [3:48:06<01:23,  1.22it/s]
100%|█████████▉| 30029/30130 [3:48:07<01:24,  1.19it/s]
100%|█████████▉| 30030/30130 [3:48:08<01:24,  1.18it/s]
100%|█████████▉| 30031/30130 [3:48:09<01:23,  1.19it/s]
100%|█████████▉| 30032/30130 [3:48:10<01:23,  1.17it/s]
100%|█████████▉| 30033/30130 [3:48:11<01:21,  1.19it/s]
100%|█████████▉| 30034/30130 [3:48:11<01:22,  1.17it/s]
100%|█████████▉| 30035/30130 [3:48:12<01:18,  1.22it/s]
100%|█████████▉| 30036/30130 [3:48:13<01:15,  1.24it/s]
100%|█████████▉| 30037/30130 [3:48:14<01:15,  1.23it/s]
100%|█████████▉| 30038/30130 [3:48:15<01:18,  1.18it/s]
100%|█████████▉| 30039/30130 [3:48:16<01:17,  1.17it/s]
100%|█████████▉| 30040/30130 [3:48:16<01:14,  1.21it/s]
                                                       
{'loss': 0.2132, 'grad_norm': 0.835830807685852, 'learning_rate': 1.871856762476476e-09, 'epoch': 9.97}

100%|█████████▉| 30040/30130 [3:48:16<01:14,  1.21it/s]
100%|█████████▉| 30041/30130 [3:48:17<01:14,  1.19it/s]
100%|█████████▉| 30042/30130 [3:48:18<01:12,  1.22it/s]
100%|█████████▉| 30043/30130 [3:48:19<01:12,  1.20it/s]
100%|█████████▉| 30044/30130 [3:48:20<01:10,  1.23it/s]
100%|█████████▉| 30045/30130 [3:48:20<01:08,  1.23it/s]
100%|█████████▉| 30046/30130 [3:48:21<01:09,  1.21it/s]
100%|█████████▉| 30047/30130 [3:48:22<01:08,  1.21it/s]
100%|█████████▉| 30048/30130 [3:48:23<01:06,  1.23it/s]
100%|█████████▉| 30049/30130 [3:48:24<01:06,  1.22it/s]
100%|█████████▉| 30050/30130 [3:48:25<01:05,  1.22it/s]
100%|█████████▉| 30051/30130 [3:48:25<01:06,  1.19it/s]
100%|█████████▉| 30052/30130 [3:48:26<01:05,  1.19it/s]
100%|█████████▉| 30053/30130 [3:48:27<01:07,  1.15it/s]
100%|█████████▉| 30054/30130 [3:48:28<01:08,  1.11it/s]
100%|█████████▉| 30055/30130 [3:48:29<01:05,  1.14it/s]
100%|█████████▉| 30056/30130 [3:48:30<01:11,  1.03it/s]
100%|█████████▉| 30057/30130 [3:48:31<01:09,  1.04it/s]
100%|█████████▉| 30058/30130 [3:48:32<01:10,  1.02it/s]
100%|█████████▉| 30059/30130 [3:48:33<01:05,  1.09it/s]
100%|█████████▉| 30060/30130 [3:48:34<01:01,  1.14it/s]
                                                       
{'loss': 0.217, 'grad_norm': 0.6908845901489258, 'learning_rate': 1.1323612836955379e-09, 'epoch': 9.98}

100%|█████████▉| 30060/30130 [3:48:34<01:01,  1.14it/s]
100%|█████████▉| 30061/30130 [3:48:35<00:58,  1.17it/s]
100%|█████████▉| 30062/30130 [3:48:35<00:58,  1.17it/s]
100%|█████████▉| 30063/30130 [3:48:36<00:57,  1.16it/s]
100%|█████████▉| 30064/30130 [3:48:37<00:57,  1.16it/s]
100%|█████████▉| 30065/30130 [3:48:38<00:54,  1.19it/s]
100%|█████████▉| 30066/30130 [3:48:39<00:53,  1.20it/s]
100%|█████████▉| 30067/30130 [3:48:40<00:55,  1.13it/s]
100%|█████████▉| 30068/30130 [3:48:41<00:54,  1.13it/s]
100%|█████████▉| 30069/30130 [3:48:41<00:53,  1.15it/s]
100%|█████████▉| 30070/30130 [3:48:42<00:50,  1.18it/s]
100%|█████████▉| 30071/30130 [3:48:43<00:50,  1.18it/s]
100%|█████████▉| 30072/30130 [3:48:44<00:48,  1.21it/s]
100%|█████████▉| 30073/30130 [3:48:45<00:51,  1.11it/s]
100%|█████████▉| 30074/30130 [3:48:46<00:50,  1.11it/s]
100%|█████████▉| 30075/30130 [3:48:47<00:49,  1.11it/s]
100%|█████████▉| 30076/30130 [3:48:47<00:46,  1.17it/s]
100%|█████████▉| 30077/30130 [3:48:48<00:44,  1.18it/s]
100%|█████████▉| 30078/30130 [3:48:49<00:43,  1.21it/s]
100%|█████████▉| 30079/30130 [3:48:50<00:41,  1.24it/s]
100%|█████████▉| 30080/30130 [3:48:51<00:41,  1.21it/s]
                                                       
{'loss': 0.2182, 'grad_norm': 0.8773943185806274, 'learning_rate': 5.777366839465614e-10, 'epoch': 9.98}

100%|█████████▉| 30080/30130 [3:48:51<00:41,  1.21it/s]
100%|█████████▉| 30081/30130 [3:48:52<00:40,  1.22it/s]
100%|█████████▉| 30082/30130 [3:48:52<00:38,  1.23it/s]
100%|█████████▉| 30083/30130 [3:48:53<00:39,  1.18it/s]
100%|█████████▉| 30084/30130 [3:48:54<00:37,  1.21it/s]
100%|█████████▉| 30085/30130 [3:48:55<00:36,  1.23it/s]
100%|█████████▉| 30086/30130 [3:48:56<00:35,  1.25it/s]
100%|█████████▉| 30087/30130 [3:48:56<00:35,  1.22it/s]
100%|█████████▉| 30088/30130 [3:48:57<00:33,  1.24it/s]
100%|█████████▉| 30089/30130 [3:48:58<00:34,  1.18it/s]
100%|█████████▉| 30090/30130 [3:48:59<00:32,  1.22it/s]
100%|█████████▉| 30091/30130 [3:49:00<00:33,  1.17it/s]
100%|█████████▉| 30092/30130 [3:49:01<00:32,  1.18it/s]
100%|█████████▉| 30093/30130 [3:49:01<00:30,  1.21it/s]
100%|█████████▉| 30094/30130 [3:49:02<00:30,  1.20it/s]
100%|█████████▉| 30095/30130 [3:49:03<00:28,  1.21it/s]
100%|█████████▉| 30096/30130 [3:49:04<00:28,  1.21it/s]
100%|█████████▉| 30097/30130 [3:49:05<00:26,  1.24it/s]
100%|█████████▉| 30098/30130 [3:49:06<00:25,  1.25it/s]
100%|█████████▉| 30099/30130 [3:49:06<00:24,  1.24it/s]
100%|█████████▉| 30100/30130 [3:49:07<00:24,  1.22it/s]
                                                       
{'loss': 0.2089, 'grad_norm': 0.8249523639678955, 'learning_rate': 2.0798552665013404e-10, 'epoch': 9.99}

100%|█████████▉| 30100/30130 [3:49:07<00:24,  1.22it/s]
100%|█████████▉| 30101/30130 [3:49:08<00:23,  1.24it/s]
100%|█████████▉| 30102/30130 [3:49:09<00:22,  1.24it/s]
100%|█████████▉| 30103/30130 [3:49:10<00:21,  1.25it/s]
100%|█████████▉| 30104/30130 [3:49:10<00:21,  1.22it/s]
100%|█████████▉| 30105/30130 [3:49:11<00:20,  1.25it/s]
100%|█████████▉| 30106/30130 [3:49:12<00:19,  1.24it/s]
100%|█████████▉| 30107/30130 [3:49:13<00:18,  1.25it/s]
100%|█████████▉| 30108/30130 [3:49:14<00:17,  1.24it/s]
100%|█████████▉| 30109/30130 [3:49:14<00:16,  1.26it/s]
100%|█████████▉| 30110/30130 [3:49:15<00:16,  1.24it/s]
100%|█████████▉| 30111/30130 [3:49:16<00:15,  1.26it/s]
100%|█████████▉| 30112/30130 [3:49:17<00:14,  1.26it/s]
100%|█████████▉| 30113/30130 [3:49:18<00:13,  1.26it/s]
100%|█████████▉| 30114/30130 [3:49:18<00:12,  1.29it/s]
100%|█████████▉| 30115/30130 [3:49:19<00:11,  1.27it/s]
100%|█████████▉| 30116/30130 [3:49:20<00:11,  1.24it/s]
100%|█████████▉| 30117/30130 [3:49:21<00:10,  1.20it/s]
100%|█████████▉| 30118/30130 [3:49:22<00:09,  1.22it/s]
100%|█████████▉| 30119/30130 [3:49:22<00:08,  1.27it/s]
100%|█████████▉| 30120/30130 [3:49:23<00:07,  1.28it/s]
{'loss': 0.2235, 'grad_norm': 0.7961286902427673, 'learning_rate': 2.3109520763675565e-11, 'epoch': 10.0}

100%|█████████▉| 30120/30130 [3:49:23<00:07,  1.28it/s]
100%|█████████▉| 30121/30130 [3:49:24<00:06,  1.30it/s]
100%|█████████▉| 30122/30130 [3:49:25<00:06,  1.26it/s]
100%|█████████▉| 30123/30130 [3:49:26<00:05,  1.23it/s]
100%|█████████▉| 30124/30130 [3:49:26<00:04,  1.23it/s]
100%|█████████▉| 30125/30130 [3:49:27<00:04,  1.24it/s]
100%|█████████▉| 30126/30130 [3:49:28<00:03,  1.25it/s]
100%|█████████▉| 30127/30130 [3:49:29<00:02,  1.20it/s]
100%|█████████▉| 30128/30130 [3:49:30<00:01,  1.23it/s]
100%|█████████▉| 30129/30130 [3:49:31<00:00,  1.19it/s]
100%|██████████| 30130/30130 [3:49:31<00:00,  1.15it/s][INFO|trainer.py:3812] 2026-01-08 13:44:50,982 >> Saving model checkpoint to outputs/pcot-llama1binst-lora-3-24/checkpoint-30130
/usr/local/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. 
  warnings.warn(  # warn only once
/usr/local/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. 
  warnings.warn(  # warn only once
/usr/local/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. 
  warnings.warn(  # warn only once
[INFO|hub.py:363] 2026-01-08 13:44:51,060 >> Offline mode: forcing local_files_only=True
[INFO|configuration_utils.py:677] 2026-01-08 13:44:51,063 >> loading configuration file /media/cfs/products-understanding-nlp/gaotianhao/huggingface/Llama-3.2-1B-Instruct/config.json
[WARNING|configuration_utils.py:547] 2026-01-08 13:44:51,063 >> You are using a model of type llama to instantiate a model of type pccot-llama. This is not supported for all configurations of models and can yield errors.
[INFO|configuration_utils.py:746] 2026-01-08 13:44:51,064 >> Model config PCCoTLlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "loss_alpha": 1.0,
  "loss_beta": 1.0,
  "loss_gamma": 1.0,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "pccot-llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_iterations": 6,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.3",
  "use_cache": true,
  "use_layerwise_std": true,
  "use_projection": false,
  "vocab_size": 128256
}

/usr/local/miniconda3/lib/python3.10/site-packages/peft/utils/save_and_load.py:300: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
  warnings.warn(
[INFO|tokenization_utils_base.py:2646] 2026-01-08 13:44:55,466 >> tokenizer config file saved in outputs/pcot-llama1binst-lora-3-24/checkpoint-30130/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2026-01-08 13:44:55,480 >> Special tokens file saved in outputs/pcot-llama1binst-lora-3-24/checkpoint-30130/special_tokens_map.json
[INFO|trainer.py:3904] 2026-01-08 13:44:57,627 >> Deleting older checkpoint [outputs/pcot-llama1binst-lora-3-24/checkpoint-30000] due to args.save_total_limit
[INFO|trainer.py:2591] 2026-01-08 13:44:57,822 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


/usr/local/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. 
  warnings.warn(  # warn only once
[INFO|trainer.py:2829] 2026-01-08 13:44:57,822 >> Loading best model from outputs/pcot-llama1binst-lora-3-24/checkpoint-22000 (score: 0.3).

                                                       
{'train_runtime': 13779.1716, 'train_samples_per_second': 279.857, 'train_steps_per_second': 2.187, 'train_loss': 0.14304465817091594, 'epoch': 10.0}

100%|██████████| 30130/30130 [3:49:39<00:00,  1.15it/s][INFO|trainer.py:2632] 2026-01-08 13:44:58,232 >> Deleting older checkpoint [outputs/pcot-llama1binst-lora-3-24/checkpoint-30130] due to args.save_total_limit

100%|██████████| 30130/30130 [3:49:39<00:00,  2.19it/s]
[INFO|trainer.py:3812] 2026-01-08 13:44:58,427 >> Saving model checkpoint to outputs/pcot-llama1binst-lora-3-24
[INFO|hub.py:363] 2026-01-08 13:44:58,469 >> Offline mode: forcing local_files_only=True
[INFO|configuration_utils.py:677] 2026-01-08 13:44:58,472 >> loading configuration file /media/cfs/products-understanding-nlp/gaotianhao/huggingface/Llama-3.2-1B-Instruct/config.json
[WARNING|configuration_utils.py:547] 2026-01-08 13:44:58,472 >> You are using a model of type llama to instantiate a model of type pccot-llama. This is not supported for all configurations of models and can yield errors.
[INFO|configuration_utils.py:746] 2026-01-08 13:44:58,472 >> Model config PCCoTLlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "loss_alpha": 1.0,
  "loss_beta": 1.0,
  "loss_gamma": 1.0,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "pccot-llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_iterations": 6,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.3",
  "use_cache": true,
  "use_layerwise_std": true,
  "use_projection": false,
  "vocab_size": 128256
}

/usr/local/miniconda3/lib/python3.10/site-packages/peft/utils/save_and_load.py:300: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
  warnings.warn(
[INFO|tokenization_utils_base.py:2646] 2026-01-08 13:45:02,238 >> tokenizer config file saved in outputs/pcot-llama1binst-lora-3-24/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2026-01-08 13:45:02,252 >> Special tokens file saved in outputs/pcot-llama1binst-lora-3-24/special_tokens_map.json
[INFO|configuration_utils.py:414] 2026-01-08 13:45:02,487 >> Configuration saved in outputs/pcot-llama1binst-lora-3-24/config.json
***** train metrics *****
  epoch                    =         10.0
  total_flos               = 2608683236GF
  train_loss               =        0.143
  train_runtime            =   3:49:39.17
  train_samples            =       385620
  train_samples_per_second =      279.857
  train_steps_per_second   =        2.187
01/08/2026 13:45:02 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:4128] 2026-01-08 13:45:02,571 >> 
***** Running Evaluation *****
[INFO|trainer.py:4130] 2026-01-08 13:45:02,571 >>   Num examples = 500
[INFO|trainer.py:4133] 2026-01-08 13:45:02,571 >>   Batch size = 16

  0%|          | 0/8 [00:00<?, ?it/s]
 25%|██▌       | 2/8 [00:00<00:00,  9.17it/s]
 38%|███▊      | 3/8 [00:00<00:00,  6.09it/s]
 50%|█████     | 4/8 [00:00<00:00,  5.25it/s]
 62%|██████▎   | 5/8 [00:00<00:00,  4.64it/s]
 75%|███████▌  | 6/8 [00:01<00:00,  4.65it/s]
 88%|████████▊ | 7/8 [00:01<00:00,  4.51it/s]
100%|██████████| 8/8 [00:01<00:00,  4.54it/s]01/08/2026 13:45:04 - INFO - __main__ - CCoT Results
01/08/2026 13:45:04 - INFO - __main__ - pred  0: The answer is:150
01/08/2026 13:45:04 - INFO - __main__ - label 0: The answer is:300
01/08/2026 13:45:04 - INFO - __main__ - pred  1: The answer is:10
01/08/2026 13:45:04 - INFO - __main__ - label 1: The answer is:10
01/08/2026 13:45:04 - INFO - __main__ - pred  2: The answer is:1400
01/08/2026 13:45:04 - INFO - __main__ - label 2: The answer is:1400
01/08/2026 13:45:04 - INFO - __main__ - pred  3: The answer is:15
01/08/2026 13:45:04 - INFO - __main__ - label 3: The answer is:15
01/08/2026 13:45:04 - INFO - __main__ - pred  4: The answer is:240
01/08/2026 13:45:04 - INFO - __main__ - label 4: The answer is:240
01/08/2026 13:45:04 - INFO - __main__ - pred  5: The answer is:10
01/08/2026 13:45:04 - INFO - __main__ - label 5: The answer is:20
01/08/2026 13:45:04 - INFO - __main__ - pred  6: The answer is:10
01/08/2026 13:45:04 - INFO - __main__ - label 6: The answer is:10
01/08/2026 13:45:04 - INFO - __main__ - pred  7: The answer is:2.The
01/08/2026 13:45:04 - INFO - __main__ - label 7: The answer is:2
01/08/2026 13:45:04 - INFO - __main__ - pred  8: The answer is:22.The
01/08/2026 13:45:04 - INFO - __main__ - label 8: The answer is:25
01/08/2026 13:45:04 - INFO - __main__ - pred  9: The answer is:26.The
01/08/2026 13:45:04 - INFO - __main__ - label 9: The answer is:25
01/08/2026 13:45:04 - INFO - __main__ - CoT Results
01/08/2026 13:45:04 - INFO - __main__ - pred  0: 100
01/08/2026 13:45:04 - INFO - __main__ - label 0: 300
01/08/2026 13:45:04 - INFO - __main__ - pred  1: 10
01/08/2026 13:45:04 - INFO - __main__ - label 1: 10
01/08/2026 13:45:04 - INFO - __main__ - pred  2: 1400
01/08/2026 13:45:04 - INFO - __main__ - label 2: 1400
01/08/2026 13:45:04 - INFO - __main__ - pred  3: 15
01/08/2026 13:45:04 - INFO - __main__ - label 3: 15
01/08/2026 13:45:04 - INFO - __main__ - pred  4: 240
01/08/2026 13:45:04 - INFO - __main__ - label 4: 240
01/08/2026 13:45:04 - INFO - __main__ - pred  5: 20
01/08/2026 13:45:04 - INFO - __main__ - label 5: 20
01/08/2026 13:45:04 - INFO - __main__ - pred  6: 20
01/08/2026 13:45:04 - INFO - __main__ - label 6: 10
01/08/2026 13:45:04 - INFO - __main__ - pred  7: 2
01/08/2026 13:45:04 - INFO - __main__ - label 7: 2
01/08/2026 13:45:04 - INFO - __main__ - pred  8: 22.
01/08/2026 13:45:04 - INFO - __main__ - label 8: 25
01/08/2026 13:45:04 - INFO - __main__ - pred  9: 28.
01/08/2026 13:45:04 - INFO - __main__ - label 9: 25

100%|██████████| 8/8 [00:01<00:00,  4.12it/s]
***** eval metrics *****
  epoch                   =       10.0
  eval_ccot_exact_match   =        0.3
  eval_cot_exact_match    =      0.632
  eval_loss               =     1.1757
  eval_perplexity         =     3.2404
  eval_runtime            = 0:00:02.19
  eval_samples            =        500
  eval_samples_per_second =    227.743
  eval_steps_per_second   =      3.644
01/08/2026 13:45:04 - INFO - __main__ - *** Predict ***
[INFO|trainer.py:4128] 2026-01-08 13:45:04,806 >> 
***** Running Prediction *****
[INFO|trainer.py:4130] 2026-01-08 13:45:04,806 >>   Num examples = 1319
[INFO|trainer.py:4133] 2026-01-08 13:45:04,806 >>   Batch size = 16

  0%|          | 0/21 [00:00<?, ?it/s]
 10%|▉         | 2/21 [00:00<00:02,  9.13it/s]
 14%|█▍        | 3/21 [00:00<00:02,  6.14it/s]
 19%|█▉        | 4/21 [00:00<00:03,  5.43it/s]
 24%|██▍       | 5/21 [00:00<00:03,  5.14it/s]
 29%|██▊       | 6/21 [00:01<00:03,  4.95it/s]
 33%|███▎      | 7/21 [00:01<00:02,  4.80it/s]
 38%|███▊      | 8/21 [00:01<00:02,  4.58it/s]
 43%|████▎     | 9/21 [00:01<00:02,  4.58it/s]
 48%|████▊     | 10/21 [00:02<00:02,  4.60it/s]
 52%|█████▏    | 11/21 [00:02<00:02,  4.57it/s]
 57%|█████▋    | 12/21 [00:02<00:01,  4.57it/s]
 62%|██████▏   | 13/21 [00:02<00:01,  4.48it/s]
 67%|██████▋   | 14/21 [00:02<00:01,  4.42it/s]
 71%|███████▏  | 15/21 [00:03<00:01,  4.37it/s]
 76%|███████▌  | 16/21 [00:03<00:01,  4.15it/s]
 81%|████████  | 17/21 [00:03<00:01,  3.93it/s]
 86%|████████▌ | 18/21 [00:03<00:00,  3.99it/s]
 90%|█████████ | 19/21 [00:04<00:00,  3.93it/s]
 95%|█████████▌| 20/21 [00:04<00:00,  4.03it/s]
100%|██████████| 21/21 [00:04<00:00,  4.12it/s]01/08/2026 13:45:09 - INFO - __main__ - CCoT Results
01/08/2026 13:45:09 - INFO - __main__ - pred  0: The answer is:18
01/08/2026 13:45:09 - INFO - __main__ - label 0: The answer is:18
01/08/2026 13:45:09 - INFO - __main__ - pred  1: The answer is:3
01/08/2026 13:45:09 - INFO - __main__ - label 1: The answer is:3
01/08/2026 13:45:09 - INFO - __main__ - pred  2: The answer is:25000
01/08/2026 13:45:09 - INFO - __main__ - label 2: The answer is:70000
01/08/2026 13:45:09 - INFO - __main__ - pred  3: The answer is:540
01/08/2026 13:45:09 - INFO - __main__ - label 3: The answer is:540
01/08/2026 13:45:09 - INFO - __main__ - pred  4: The answer is:25
01/08/2026 13:45:09 - INFO - __main__ - label 4: The answer is:20
01/08/2026 13:45:09 - INFO - __main__ - pred  5: The answer is:56
01/08/2026 13:45:09 - INFO - __main__ - label 5: The answer is:64
01/08/2026 13:45:09 - INFO - __main__ - pred  6: The answer is:260
01/08/2026 13:45:09 - INFO - __main__ - label 6: The answer is:260
01/08/2026 13:45:09 - INFO - __main__ - pred  7: The answer is:440
01/08/2026 13:45:09 - INFO - __main__ - label 7: The answer is:160
01/08/2026 13:45:09 - INFO - __main__ - pred  8: The answer is:270
01/08/2026 13:45:09 - INFO - __main__ - label 8: The answer is:45
01/08/2026 13:45:09 - INFO - __main__ - pred  9: The answer is:510
01/08/2026 13:45:09 - INFO - __main__ - label 9: The answer is:460
01/08/2026 13:45:10 - INFO - __main__ - CoT Results
01/08/2026 13:45:10 - INFO - __main__ - pred  0: <<16-3-4=9>><< answer is:18
01/08/2026 13:45:10 - INFO - __main__ - label 0: 18
01/08/2026 13:45:10 - INFO - __main__ - pred  1: 3
01/08/2026 13:45:10 - INFO - __main__ - label 1: 3
01/08/2026 13:45:10 - INFO - __main__ - pred  2: 70000
01/08/2026 13:45:10 - INFO - __main__ - label 2: 70000
01/08/2026 13:45:10 - INFO - __main__ - pred  3: 540
01/08/2026 13:45:10 - INFO - __main__ - label 3: 540
01/08/2026 13:45:10 - INFO - __main__ - pred  4: <<20*20=60>><< answer is:5
01/08/2026 13:45:10 - INFO - __main__ - label 4: 20
01/08/2026 13:45:10 - INFO - __main__ - pred  5: 64
01/08/2026 13:45:10 - INFO - __main__ - label 5: 64
01/08/2026 13:45:10 - INFO - __main__ - pred  6: 260
01/08/2026 13:45:10 - INFO - __main__ - label 6: 260
01/08/2026 13:45:10 - INFO - __main__ - pred  7: <<200*0*.01=80>><<200/2=40>><<200-2=100>><< answer is:240
01/08/2026 13:45:10 - INFO - __main__ - label 7: 160
01/08/2026 13:45:10 - INFO - __main__ - pred  8: 45
01/08/2026 13:45:10 - INFO - __main__ - label 8: 45
01/08/2026 13:45:10 - INFO - __main__ - pred  9: 460
01/08/2026 13:45:10 - INFO - __main__ - label 9: 460

100%|██████████| 21/21 [00:05<00:00,  3.93it/s]
***** test metrics *****
  test_ccot_exact_match   =     0.2707
  test_cot_exact_match    =     0.6361
  test_loss               =     1.1958
  test_perplexity         =     3.3062
  test_runtime            = 0:00:05.58
  test_samples            =       1319
  test_samples_per_second =    236.114
  test_steps_per_second   =      3.759
[INFO|modelcard.py:449] 2026-01-08 13:45:10,451 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'dataset': {'name': './data/whynlp-gsm8k-aug', 'type': './data/whynlp-gsm8k-aug'}}
gaotianhao1-ecda057c:3408:3494 [0] NCCL INFO [Service thread] Connection closed by localRank 3
gaotianhao1-ecda057c:3409:3493 [0] NCCL INFO [Service thread] Connection closed by localRank 2
```

My training script is as follows:
```#!/bin/bash
# --attn_implementation flash_attention_2 \
# --use_liger_kernel \
echo "Start running..."
export HF_ENDPOINT=https://hf-mirror.com

accelerate launch run_ccot.py \
    --model_name_or_path  /media/cfs/Llama-3.2-1B-Instruct \
    --config_name configs/pccot_llama3.2_1b_inst.json \
    --config_overrides num_iterations=3 \
    --num_latent_tokens 24 \
    --dataset_name ./data/whynlp-gsm8k-aug \
    --label_names labels cot_labels \
    --lora_target_modules q_proj-k_proj-v_proj-o_proj-down_proj-up_proj-gate_pro \
    --lora_modules_to_save "" \
    --remove_unused_columns false \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 16 \
    --auto_find_batch_size \
    --gradient_accumulation_steps 1 \
    --block_size 1024 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.03 \
    --learning_rate 8e-5 \
    --weight_decay 1e-1 \
    --bf16 \
    --torch_dtype bfloat16 \
    --do_train \
    --do_eval \
    --do_predict \
    --num_train_epochs 10 \
    --save_total_limit 1 \
    --save_strategy steps \
    --save_steps 1000 \
    --evaluation_strategy steps \
    --eval_steps 200 \
    --logging_steps 20 \
    --load_best_model_at_end True \
    --metric_for_best_model eval_ccot_exact_match \
    --report_to none \
    --run_name pcot-llama1binst-lora-3-24 \
    --output_dir outputs/pcot-llama1binst-lora-3-24 \
    --overwrite_output_dir
```

I manually lowered the `--learning_rate to 8e-5`. When I used the default 8e-4, I observed severe loss oscillation and abnormally high grad_norm values during the initial phase of training. Is this the key difference? If not, could you help me troubleshoot the problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't reproduce result on Llama-3.2-1B-Instruct #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can't reproduce result on Llama-3.2-1B-Instruct #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions