first commit

master
abel 3 years ago
parent 59184662de
commit 9a9cfd5444

@ -0,0 +1,170 @@
SUMMARY
================================================================================
These files contain 1,000,209 anonymous ratings of approximately 3,900 movies
made by 6,040 MovieLens users who joined MovieLens in 2000.
USAGE LICENSE
================================================================================
Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set. The data set may be used for any research
purposes under the following conditions:
* The user may not state or imply any endorsement from the
University of Minnesota or the GroupLens Research Group.
* The user must acknowledge the use of the data set in
publications resulting from the use of the data set
(see below for citation information).
* The user may not redistribute the data without separate
permission.
* The user may not use this information for any commercial or
revenue-bearing purposes without first obtaining permission
from a faculty member of the GroupLens Research Project at the
University of Minnesota.
If you have any further questions or comments, please contact GroupLens
<grouplens-info@cs.umn.edu>.
CITATION
================================================================================
To acknowledge use of the dataset in publications, please cite the following
paper:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History
and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4,
Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872
ACKNOWLEDGEMENTS
================================================================================
Thanks to Shyong Lam and Jon Herlocker for cleaning up and generating the data
set.
FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
================================================================================
The GroupLens Research Project is a research group in the Department of
Computer Science and Engineering at the University of Minnesota. Members of
the GroupLens Research Project are involved in many research projects related
to the fields of information filtering, collaborative filtering, and
recommender systems. The project is lead by professors John Riedl and Joseph
Konstan. The project began to explore automated collaborative filtering in
1992, but is most well known for its world wide trial of an automated
collaborative filtering system for Usenet news in 1996. Since then the project
has expanded its scope to research overall information filtering solutions,
integrating in content-based methods as well as improving current collaborative
filtering technology.
Further information on the GroupLens Research project, including research
publications, can be found at the following web site:
http://www.grouplens.org/
GroupLens Research currently operates a movie recommender based on
collaborative filtering:
http://www.movielens.org/
RATINGS FILE DESCRIPTION
================================================================================
All ratings are contained in the file "ratings.dat" and are in the
following format:
UserID::MovieID::Rating::Timestamp
- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings
USERS FILE DESCRIPTION
================================================================================
User information is in the file "users.dat" and is in the following
format:
UserID::Gender::Age::Occupation::Zip-code
All demographic information is provided voluntarily by the users and is
not checked for accuracy. Only users who have provided some demographic
information are included in this data set.
- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:
* 1: "Under 18"
* 18: "18-24"
* 25: "25-34"
* 35: "35-44"
* 45: "45-49"
* 50: "50-55"
* 56: "56+"
- Occupation is chosen from the following choices:
* 0: "other" or not specified
* 1: "academic/educator"
* 2: "artist"
* 3: "clerical/admin"
* 4: "college/grad student"
* 5: "customer service"
* 6: "doctor/health care"
* 7: "executive/managerial"
* 8: "farmer"
* 9: "homemaker"
* 10: "K-12 student"
* 11: "lawyer"
* 12: "programmer"
* 13: "retired"
* 14: "sales/marketing"
* 15: "scientist"
* 16: "self-employed"
* 17: "technician/engineer"
* 18: "tradesman/craftsman"
* 19: "unemployed"
* 20: "writer"
MOVIES FILE DESCRIPTION
================================================================================
Movie information is in the file "movies.dat" and is in the following
format:
MovieID::Title::Genres
- Titles are identical to titles provided by the IMDB (including
year of release)
- Genres are pipe-separated and are selected from the following genres:
* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
- Some MovieIDs do not correspond to a movie due to accidental duplicate
entries and/or test entries
- Movies are mostly entered by hand, so errors and inconsistencies may exist

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

@ -0,0 +1,28 @@
import torch
from torch.utils.data import DataLoader, Dataset
class UserItemRatingDataset(Dataset):
"""
Wrapper, convert input <user, item, rating> Tensor into torch Dataset
"""
def __init__(self, user_tensor, item_tensor, target_tensor):
"""
args:
target_tensor: torch.Tensor, the corresponding rating for <user, item> pair
"""
self._user_tensor = user_tensor
self._item_tensor = item_tensor
self._target_tensor = target_tensor
def __getitem__(self, index):
return self._user_tensor[index], self._item_tensor[index], self._target_tensor[index]
def __len__(self):
return self._user_tensor.size(0)
def Construct_DataLoader(users, items, ratings, batchsize):
assert batchsize > 0
dataset = UserItemRatingDataset(user_tensor=torch.LongTensor(users),
item_tensor=torch.LongTensor(items),
target_tensor=torch.LongTensor(ratings))
return DataLoader(dataset, batch_size=batchsize, shuffle=True)

@ -0,0 +1,142 @@
import random
import pandas as pd
import numpy as np
from copy import deepcopy
random.seed(0)
class DataProcess(object):
def __init__(self, filename):
self._filename = filename
self._loadData()
self._preProcess()
self._binarize(self._originalRatings)
# 对'userId'这一列的数据,先去重,然后构成一个用户列表
self._userPool = set(self._originalRatings['userId'].unique())
self._itemPool = set(self._originalRatings['itemId'].unique())
# print("user_pool size: ", len(self._userPool))
# print("item_pool size: ", len(self._itemPool))
self._select_Negatives(self._originalRatings)
self._split_pool(self._preprocessRatings)
def _loadData(self):
self._originalRatings = pd.read_csv(self._filename, sep='::', header=None, names=['uid', 'mid', 'rating', 'timestamp'],
engine='python')
return self._originalRatings
def _preProcess(self):
"""
对user和item都重新编号这里这么做的原因是因为模型的输入是one-hot向量需要把user和item都限制在Embedding的长度之内
模型的两个输入的长度分别是user和item的数量所以要重新从0编号
"""
# 1. 新建名为"userId"的列这列对用户从0开始编号
user_id = self._originalRatings[['uid']].drop_duplicates().reindex()
user_id['userId'] = np.arange(len(user_id)) #根据user的长度创建一个数组
# 将原先的DataFrame与user_id按照"uid"这一列进行合并
self._originalRatings = pd.merge(self._originalRatings, user_id, on=['uid'], how='left')
# 2. 对物品进行重新排列
item_id = self._originalRatings[['mid']].drop_duplicates()
item_id['itemId'] = np.arange(len(item_id))
self._originalRatings = pd.merge(self._originalRatings, item_id, on=['mid'], how='left')
# 按照['userId', 'itemId', 'rating', 'timestamp']的顺序重新排列
self._originalRatings = self._originalRatings[['userId', 'itemId', 'rating', 'timestamp']]
# print(self._originalRatings)
# print('Range of userId is [{}, {}]'.format(self._originalRatings.userId.min(), self._originalRatings.userId.max()))
# print('Range of itemId is [{}, {}]'.format(self._originalRatings.itemId.min(), self._originalRatings.itemId.max()))
def _binarize(self, ratings):
"""
binarize data into 0 or 1 for implicit feedback
"""
ratings = deepcopy(ratings)
ratings['rating'][ratings['rating'] > 0] = 1.0
self._preprocessRatings = ratings
# print("binary: \n", self._preprocessRatings)
def _select_Negatives(self, ratings):
"""
Select al;l negative samples and 100 sampled negative items for each user.
"""
# 构造user-item表
interact_status = ratings.groupby('userId')['itemId'].apply(set).reset_index().rename(
columns={'itemId': 'interacted_items'})
# print("interact_status: \n", interact_status)
# 把与用户没有产生过交互的样本都当做是负样本
interact_status['negative_items'] = interact_status['interacted_items'].apply(lambda x: self._itemPool - x)
# 从上面的全部负样本中随机选99个出来
interact_status['negative_samples'] = interact_status['negative_items'].apply(lambda x: random.sample(x, 99))
# print("after sampling interact_status: \n", interact_status)
#
# print("select and rearrange columns")
self._negatives = interact_status[['userId', 'negative_items', 'negative_samples']]
def _split_pool(self, ratings):
"""leave one out train/test split """
# print("sort by timestamp descend")
# 先按照'userID'进行分组,然后根据时间戳降序排列
ratings['rank_latest'] = ratings.groupby(['userId'])['timestamp'].rank(method='first', ascending=False)
# print(ratings)
# 选取排名第一的数据作为测试集,也就是最新的那个数据
test = ratings[ratings['rank_latest'] == 1]
# 选取所有排名靠后的,也就是历史数据当做训练集
train = ratings[ratings['rank_latest'] > 1]
# print("test: \n", test)
# print("train: \n", train)
# print("size of test {0}, size of train {1}".format(len(test), len(train)))
# 确保训练集和测试集的userId是一样的
assert train['userId'].nunique() == test['userId'].nunique()
self.train_ratings = train[['userId', 'itemId', 'rating']]
self.test_ratings = test[['userId', 'itemId', 'rating']]
def sample_generator(self, num_negatives):
# 合并之后的train_ratings的列包括['userId','itemId''rating','negative_items']
train_ratings = pd.merge(self.train_ratings, self._negatives[['userId', 'negative_items']], on='userId')
# 从用户的全部负样本集合中随机选择num_negatives个样本当做负样本并产生一个新的名为"negatives"的列
train_ratings['negatives'] = train_ratings['negative_items'].apply(lambda x: random.sample(x, num_negatives))
# print(train_ratings)
# 构造模型所需要的数据分别是输入user、items以及目标分值ratings。
users, items, ratings = [], [], []
for row in train_ratings.itertuples():
# 构造正样本分别是userId itemId以及目标分值1
users.append(int(row.userId))
items.append(int(row.itemId))
ratings.append(float(row.rating))
# 为每个用户构造num_negatives个负样本分别是userId itemId以及目标分值0
for i in range(num_negatives):
users.append(int(row.userId))
items.append(int(row.negatives[i]))
ratings.append(float(0)) # 负样本的ratings为0直接强行设置为0
return users, items, ratings
def test_generator(self, num_negatives):
# 合并之后的train_ratings的列包括['userId','itemId''rating','negative_items']
test_ratings = pd.merge(self.test_ratings, self._negatives[['userId', 'negative_items']], on='userId')
# 从用户的全部负样本集合中随机选择num_negatives个样本当做负样本并产生一个新的名为"negatives"的列
test_ratings['negatives'] = test_ratings['negative_items'].apply(lambda x: random.sample(x, num_negatives))
# print(test_ratings)
# 构造模型所需要的数据分别是输入user、items以及目标分值ratings。
users, items, ratings = [], [], []
for row in test_ratings.itertuples():
# 构造正样本分别是userId itemId以及目标分值1
users.append(int(row.userId))
items.append(int(row.itemId))
ratings.append(float(row.rating))
# 为每个用户构造num_negatives个负样本分别是userId itemId以及目标分值0
for i in range(num_negatives):
users.append(int(row.userId))
items.append(int(row.negatives[i]))
ratings.append(float(0)) # 负样本的ratings为0直接强行设置为0
return users, items, ratings

@ -0,0 +1,192 @@
import torch
import torch.nn as nn
from abc import ABC, abstractmethod
class NCF(ABC):
def __init__(self, config, latent_dim_gmf=8, latent_dim_mlp=8):
self._config = config
self._num_users = config['num_users']
self._num_items = config['num_items']
self._latent_dim_gmf = latent_dim_gmf
self._latent_dim_mlp = latent_dim_mlp
# 建立MLP模型的user Embedding层和item Embedding层输入的向量长度分别为用户的数量item的数量输出都是隐式空间的维度latent dim
self._embedding_user_mlp = torch.nn.Embedding(num_embeddings=self._num_users, embedding_dim=self._latent_dim_mlp)
self._embedding_item_mlp = torch.nn.Embedding(num_embeddings=self._num_users, embedding_dim=self._latent_dim_mlp)
# 建立GMP模型的user Embedding层和item Embedding层输入的向量长度分别为用户的数量item的数量输出都是隐式空间的维度latent dim
self._embedding_user_gmf = torch.nn.Embedding(num_embeddings=self._num_users, embedding_dim=self._latent_dim_gmf)
self._embedding_item_gmf = torch.nn.Embedding(num_embeddings=self._num_users, embedding_dim=self._latent_dim_gmf)
# 全连接层
self._fc_layers = torch.nn.ModuleList()
for idx, (in_size, out_size) in enumerate(zip(config['layers'][:-1], config['layers'][1:])):
self._fc_layers.append(torch.nn.Linear(in_size, out_size))
# 激活函数
self._logistic = nn.Sigmoid()
@property
def fc_layers(self):
return self._fc_layers
@property
def embedding_user_gmf(self):
return self._embedding_user_gmf
@property
def embedding_item_gmf(self):
return self._embedding_item_gmf
@property
def embedding_user_mlp(self):
return self._embedding_user_mlp
@property
def embedding_item_mlp(self):
return self._embedding_item_mlp
def saveModel(self):
torch.save(self.state_dict(), self._config['model_name'])
@abstractmethod
def load_preTrained_weights(self):
pass
class GMF(NCF, nn.Module):
def __init__(self, config, latent_dim_gmf):
nn.Module.__init__(self)
NCF.__init__(self, config=config, latent_dim_gmf=latent_dim_gmf)
# 创建一个线性模型输入为潜在特征向量输出向量长度为1
self._affine_output = nn.Linear(in_features=self._latent_dim_gmf, out_features=1)
@property
def affine_output(self):
return self._affine_output
def forward(self, user_indices, item_indices):
"""
前向传播
:param user_indices: user Tensor
:param item_indices: item Tensor
:return: predicted rating
"""
# 先将user和item转换为对应的Embedding表示注意这个支持Tensor操作即传入的是一个user列表对其中每一个user都会执行Embedding操作即都会使用Embedding表示
user_embedding = self._embedding_user_gmf(user_indices)
item_embedding = self._embedding_item_gmf(item_indices)
# 对user_embedding和user_embedding进行逐元素相乘, 这一步其实就是MF算法的实现
element_product = torch.mul(user_embedding, item_embedding)
# 将逐元素的乘积的结果通过一个S型神经元
logits = self._affine_output(element_product)
rating = self._logistic(logits)
return rating
def load_preTrained_weights(self):
pass
class MLP(NCF, nn.Module):
def __init__(self, config, latent_dim_mlp):
nn.Module.__init__(self)
NCF.__init__(self, config=config, latent_dim_mlp=latent_dim_mlp)
# 创建一个线性模型输入为潜在特征向量输出向量长度为1
self._affine_output = torch.nn.Linear(in_features=config['layers'][-1], out_features=1)
@property
def affine_output(self):
return self._affine_output
def forward(self, user_indices, item_indices):
"""
:param user_indices: user Tensor
:param item_indices: item Tensor
"""
# 先将user和item转换为对应的Embedding表示注意这个支持Tensor操作即传入的是一个user列表
# 对其中每一个user都会执行Embedding操作即都会使用Embedding表示
user_embedding = self._embedding_user_mlp(user_indices)
item_embedding = self._embedding_item_mlp(item_indices)
vector = torch.cat([user_embedding, item_embedding], dim=-1) # concat latent vector
for idx, _ in enumerate(range(len(self._fc_layers))):
vector = self._fc_layers[idx](vector)
vector = torch.nn.ReLU()(vector)
## Batch normalization
# vector = torch.nn.BatchNorm1d()(vector)
## DroupOut layer
# vector = torch.nn.Dropout(p=0.5)(vector)
logits = self._affine_output(vector)
rating = self._logistic(logits)
return rating
def load_preTrained_weights(self):
config = self._config
gmf_model = GMF(config, config['latent_dim_gmf'])
if config['use_cuda'] is True:
gmf_model.cuda()
# 加载GMF模型参数到指定的GPU上
state_dict = torch.load(self._config['pretrain_gmf'])
#map_location=lambda storage, loc: storage.cuda(device=self._config['device_id']))
#map_location = {'cuda:0': 'cpu'})
gmf_model.load_state_dict(state_dict, strict=False)
self._embedding_item_mlp.weight.data = gmf_model.embedding_item_gmf.weight.data
self._embedding_user_mlp.weight.data = gmf_model.embedding_user_gmf.weight.data
class NeuMF(NCF, nn.Module):
def __init__(self, config, latent_dim_gmf, latent_dim_mlp):
nn.Module.__init__(self)
NCF.__init__(self, config, latent_dim_gmf, latent_dim_mlp)
# 创建一个线性模型输入为GMF模型和MLP模型的潜在特征向量长度之和输出向量长度为1
self._affine_output = torch.nn.Linear(in_features=config['layers'][-1] + config['latent_dim_gmf'], out_features=1)
@property
def affine_output(self):
return self._affine_output
def forward(self, user_indices, item_indices):
user_embedding_mlp = self._embedding_user_mlp(user_indices)
item_embedding_mlp = self._embedding_item_mlp(item_indices)
user_embedding_gmf = self._embedding_user_gmf(user_indices)
item_embedding_gmf = self._embedding_item_gmf(item_indices)
# concat the two latent vector
mlp_vector = torch.cat([user_embedding_mlp, item_embedding_mlp], dim=-1)
# multiply the two latent vector
gmf_vector = torch.mul(user_embedding_gmf, item_embedding_gmf)
for idx, _ in enumerate(range(len(self._fc_layers))):
mlp_vector = self._fc_layers[idx](mlp_vector)
mlp_vector = torch.nn.ReLU()(mlp_vector)
vector = torch.cat([mlp_vector, gmf_vector], dim=-1)
logits = self._affine_output(vector)
rating = self._logistic(logits)
return rating
def load_preTrained_weights(self):
# 加载MLP模型参数
mlp_model = MLP(self._config['mlp_config'], self._config['mlp_config']['latent_dim_mlp'])
if self._config['use_cuda'] is True:
mlp_model.cuda()
state_dict = torch.load(self._config['pretrain_mlp'])
# map_location=lambda storage, loc: storage.cuda(device=self._config['device_id']))
# map_location = {'cuda:0': 'cpu'})
mlp_model.load_state_dict(state_dict, strict=False)
self._embedding_item_mlp.weight.data = mlp_model.embedding_item_mlp.weight.data
self._embedding_user_mlp.weight.data = mlp_model.embedding_user_mlp.weight.data
for idx in range(len(self._fc_layers)):
self._fc_layers[idx].weight.data = mlp_model.fc_layers[idx].weight.data
# 加载GMF模型参数
gmf_model = GMF(self._config['gmf_config'], self._config['gmf_config']['latent_dim_gmf'])
if self._config['use_cuda'] is True:
gmf_model.cuda()
state_dict = torch.load(self._config['pretrain_gmf'])
# map_location=lambda storage, loc: storage.cuda(device=self._config['device_id']))
# map_location = {'cuda:0': 'cpu'})
mlp_model.load_state_dict(state_dict, strict=False)
self._embedding_item_gmf.weight.data = gmf_model.embedding_item_gmf.weight.data
self._embedding_user_gmf.weight.data = gmf_model.embedding_user_gmf.weight.data
self._affine_output.weight.data = self._config['alpha'] * torch.cat([mlp_model.affine_output.weight.data, gmf_model.affine_output.weight.data], dim=-1)
self._affine_output.bias.data = self._config['alpha'] * (mlp_model.affine_output.bias.data + gmf_model.affine_output.bias.data)

@ -0,0 +1,105 @@
import torch
from NCF.dataloader import Construct_DataLoader
def pick_optimizer(network, params):
optimizer = None
if params['optimizer'] == 'sgd':
optimizer = torch.optim.SGD(network.parameters(),
lr=params['sgd_lr'],
momentum=params['sgd_momentum'],
weight_decay=params['l2_regularization'])
elif params['optimizer'] == 'adam':
optimizer = torch.optim.Adam(network.parameters(),
lr=params['adam_lr'],
weight_decay=params['l2_regularization'])
elif params['optimizer'] == 'rmsprop':
optimizer = torch.optim.RMSprop(network.parameters(),
lr=params['rmsprop_lr'],
alpha=params['rmsprop_alpha'],
momentum=params['rmsprop_momentum'])
return optimizer
class Trainer(object):
def __init__(self, model, config):
self._config = config
self._model = model
# 选择优化器
self._optimizer = pick_optimizer(self._model, self._config)
# 定义损失函数,对于隐反馈数据,这里使用交叉熵损失函数
self._crit = torch.nn.BCELoss()
def _train_single_batch(self, users, items, ratings):
"""
对单个小批量数据进行训练
:param users: user Tensor
:param items: item Tensor
:param ratings: rating Tensor
:return:
"""
if self._config['use_cuda'] is True:
# 将这些数据由CPU迁移到GPU
users, items, ratings = users.cuda(), items.cuda(), ratings.cuda()
# 先将梯度清零,如果不清零那么这个梯度就和上一个mini-batch有关
self._optimizer.zero_grad()
# 模型的输入users items调用forward进行前向传播
ratings_pred = self._model(users, items)
# 通过交叉熵损失函数来计算损失, ratings_pred.view(-1)代表将预测结果摊平,变成一维的结构。
loss = self._crit(ratings_pred.view(-1), ratings)
# 反向传播计算梯度
loss.backward()
# 梯度下降等优化器 更新参数
self._optimizer.step()
# 将loss的值提取成python的float类型
loss = loss.item()
return loss
def _train_an_epoch(self, train_loader, epoch_id):
"""
训练一个Epoch即将训练集中的所有样本全部都过一遍
:param train_loader: Torch的DataLoader
:param epoch_id: 训练轮次Id
:return:
"""
# 告诉模型目前处于训练模式启用dropout以及batch normalization
self._model.train()
total_loss = 0
# 从DataLoader中获取小批量的id以及数据
for batch_id, batch in enumerate(train_loader):
assert isinstance(batch[0], torch.LongTensor)
# 这里的user, item, rating大小变成了1024维了因为batch_size是1024即每次选取1024个样本数据进行训练
user, item, rating = batch[0], batch[1], batch[2]
rating = rating.float()
loss = self._train_single_batch(user, item, rating)
print('[Training Epoch {}] Batch {}, Loss {}'.format(epoch_id, batch_id, loss))
total_loss += loss
print('Training Epoch: {}, TotalLoss: {}'.format(epoch_id, total_loss))
def train(self, sampleGenerator):
# 是否使用GPU加速
self.use_cuda()
# 是否使用预先训练好的参数
self.load_preTrained_weights()
for epoch in range(self._config['num_epoch']):
print('-' * 20 + ' Epoch {} starts '.format(epoch) + '-' * 20)
# 每个轮次都重新随机产生样本数据集
users, items, ratings = sampleGenerator(num_negatives=self._config['num_negative'])
# 构造一个DataLoader
data_loader = Construct_DataLoader(users=users, items=items, ratings=ratings,
batchsize=self._config['batch_size'])
# 训练一个轮次
self._train_an_epoch(data_loader, epoch_id=epoch)
def use_cuda(self):
if self._config['use_cuda'] is True:
assert torch.cuda.is_available(), 'CUDA is not available'
torch.cuda.set_device(self._config['device_id'])
self._model.cuda()
def load_preTrained_weights(self):
if self._config['pretrain'] is True:
self._model.load_preTrained_weights()
def save(self):
self._model.saveModel()

Binary file not shown.

Binary file not shown.

Binary file not shown.

@ -0,0 +1,120 @@
import sys
import os.path as osp
this_dir = osp.dirname(__file__)
lib_path = osp.join(this_dir, '..')
sys.path.insert(0, lib_path)
from NCF.dataprocess import DataProcess
from NCF.network import GMF,MLP,NeuMF
from NCF.trainer import Trainer
import numpy as np
import torch
gmf_config = {'num_epoch': 100,
'batch_size': 1024,
'optimizer': 'adam',
'adam_lr': 1e-3,
'num_users': 6040,
'num_items': 3706,
'latent_dim_gmf': 8,
'num_negative': 4,
'layers': [],
'l2_regularization': 0, # 0.01
'pretrain': False, # do not modify this
'use_cuda': True,
'device_id': 0,
'model_name': '../TrainedModels/NCF_GMF.model'
}
mlp_config = {'num_epoch': 100,
'batch_size': 1024, # 1024,
'optimizer': 'adam',
'adam_lr': 1e-3,
'num_users': 6040,
'num_items': 3706,
'latent_dim_mlp': 8,
'latent_dim_gmf': 8,
'num_negative': 4,
'layers': [16,64,32,16,8], # layers[0] is the concat of latent user vector & latent item vector
'l2_regularization': 0.0000001, # MLP model is sensitive to hyper params
'use_cuda': True,
'device_id': 0,
'pretrain': True,
'gmf_config': gmf_config,
'pretrain_gmf': '../TrainedModels/NCF_GMF.model',
'model_name': '../TrainedModels/NCF_MLP.model'
}
neumf_config = {'num_epoch': 100,
'batch_size': 1024, #1024
'optimizer': 'adam',
'adam_lr': 1e-3,
'num_users': 6040,
'num_items': 3706,
'latent_dim_gmf': 8,
'latent_dim_mlp': 8,
'num_negative': 4,
'layers': [16,32,16,8], # layers[0] 是用户和物品隐层表示concat的维度
'l2_regularization': 0.01,
'alpha': 0.5, # 用于控制GMF和MLP模型参数的权重
'use_cuda': True,
'device_id': 0,
'pretrain': False,
'gmf_config': gmf_config,
'pretrain_gmf': '../TrainedModels/NCF_GMF.model',
'mlp_config': mlp_config,
'pretrain_mlp': '../TrainedModels/NCF_MLP.model',
'model_name': '../TrainedModels/NCF_NeuMF.model'
}
if __name__ == "__main__":
####################################################################################
# NCF 神经协同过滤算法
####################################################################################
# 加载和预处理数据
dp = DataProcess("../Data/ml-1m/ratings.dat")
# 初始化GMP模型
config = gmf_config
model = GMF(config, config['latent_dim_gmf'])
# # 初始化MLP模型
config = mlp_config
model = MLP(config, config['latent_dim_mlp'])
# 初始化NeuMF模型
config = neumf_config
model = NeuMF(config, config['latent_dim_gmf'], config['latent_dim_mlp'])
# ###############################################################
# 模型训练阶段
# ###############################################################
trainer = Trainer(model=model, config=config)
trainer.train(dp.sample_generator)
trainer.save()
# ###############################################################
# 模型测试阶段
# ###############################################################
# 加载数据集
# dp = DataProcess("../Data/ml-1m/ratings.dat")
config = neumf_config
neumf = NeuMF(config, config['latent_dim_gmf'], config['latent_dim_mlp'])
state_dict = torch.load("../TrainedModels/NCF_NeuMF.model", map_location=torch.device('cpu'))
neumf.load_state_dict(state_dict, strict=False)
# 对用户User_id喜好度进行预测
User_id = 1
result = np.zeros((3706))
for j in range(3706):
socre = neumf.forward(torch.LongTensor([User_id]), torch.LongTensor([j]))
score = socre.detach().numpy()
result[j] = socre[0][0]
# 选取User_id喜好度最高的N个电影id进行推荐
N = 5
indexs = np.argsort(-result)[:N]
print(indexs)
Loading…
Cancel
Save