This is the implementation of H^2Rec framework.
To ease the configuration of the environment, I list versions of my hardware and software equipments:
You can pip install the requirements.txt to configure the environment.
You can preprocess the dataset and get the LLMs embedding according to the following steps:
- The raw dataset downloaded from website should be put into
/data/<yelp/fashion/beauty>/raw/. The Yelp dataset can be obtained from https://www.yelp.com/dataset. The fashion and beauty datasets can be obtained from https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews. - Conduct the preprocessing code
data/data_process.pyto filter cold-start users and items. After the procedure, you will get the id file/data/<yelp/fashion/beauty>/hdanled/id_map.jsonand the interaction file/data/<yelp/fashion/beauty>/handled/inter_seq.txt. - Convert the interaction file to the format used in this repo by running
data/convert_inter.ipynb. - To get the LLMs embedding for each dataset, please run the jupyter notebooks
/data/<yelp/fashion/beauty>/get_item_embedding.ipynbAfter the running, you will get the LLMs item embedding file/data/<yelp/fashion/beauty>/handled/itm_emb_np.pkl. - For hot start initialization, we need to run the jupyter notebook
data/pca.ipynbto get the dimension-reduced LLMs item embedding for initialization, i.e.,/data/<yelp/fashion/beauty>/handled/pca64_itm_emb_np.pkl. - For SID generation, please refer to the 'generate_semantic_codes_RQVAE.py' under '/data/yelp/handled' to generate the corresponding semantic code json file and embedding .pkl and .pth files.
After that we can run the main framework by setting your parameter using main.py.
The whole structure of the framework are listed in the 'DualTrisRec.py' under the 'model' file.
The basic semantic codes embeddings are constructed in 'RQVAEEmbedding.py', also under the 'model' file.
Since we change the traditional '1 to 1' InfoNCE to '1 to many' with our positive sample selections, We precompute the positive samples using 'precompute_positive_pairs_v2.py' to accelerate the loss calculation.