init

4 years ago · 1dca992a8d
parent 1cd49fb2b9
commit 1dca992a8d
8 changed files with 1010730 additions and 14 deletions
--- a/LICENSE-2.0.txt
+++ b/LICENSE-2.0.txt
@ -0,0 +1,202 @@
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [2019] [Lorenz K Muller]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/README.md
+++ b/README.md
@ -1,19 +1,29 @@
-#### 从命令行创建一个新的仓库
+# kernelNet MovieLens-1M

-```bash
-touch README.md
-git init
-git add README.md
-git commit -m "first commit"
-git remote add origin https://bdgit.educoder.net/ZhengHui/kernelNet.git
-git push -u origin master
+State of the art model for MovieLens-1M.

-```
+This is a minimal implementation of a kernelNet sparsified autoencoder for MovieLens-1M. 
+See http://proceedings.mlr.press/v80/muller18a.html

-#### 从命令行推送已经创建的仓库
+## Setup
+Download this repository

-```bash
-git remote add origin https://bdgit.educoder.net/ZhengHui/kernelNet.git
-git push -u origin master
+### Requirements
+* numpy
+* scipy
+* tensorflow (tested with version 1.13)

-```
+### Dataset
+Expects MovieLens-1M dataset in a subdirectory named ml-1m.
+Get it here https://grouplens.org/datasets/movielens/1m/
+
+or on linux run in the project directory
+
+```wget --output-document=ml-1m.zip http://www.grouplens.org/system/files/ml-1m.zip; unzip ml-1m.zip```
+
+## Run
+```python kernelNet_ml1m.py```
+optional arguments are the L2 and sparsity regularization strength. Default is 60. and 0.013
+
+### Results
+with the default parameters this slightly outperforms the paper model at 0.823 validation RMSE (10-times repeated random sub-sampling validation)
--- a/dataLoader.py
+++ b/dataLoader.py
@ -0,0 +1,66 @@
+'''
+written by Lorenz Muller
+'''
+
+import numpy as np
+from time import time
+
+
+def loadData(path='./', valfrac=0.1, delimiter='::', seed=1234,
+             transpose=False):
+    '''
+    loads ml-1m data
+
+    :param path: path to the ratings file
+    :param valfrac: fraction of data to use for validation
+    :param delimiter: delimiter used in data file
+    :param seed: random seed for validation splitting
+    :param transpose: flag to transpose output matrices (swapping users with movies)
+    :return: train ratings (n_u, n_m), valid ratings (n_u, n_m)
+    '''
+    np.random.seed(seed)
+
+    tic = time()
+    print('reading data...')
+    data = np.loadtxt(path, skiprows=0, delimiter=delimiter).astype('int32')
+    print('data read in', time() - tic, 'seconds')
+
+    n_u = np.unique(data[:, 0]).shape[0]  # number of users
+    n_m = np.unique(data[:, 1]).shape[0]  # number of movies
+    n_r = data.shape[0]  # number of ratings
+
+    # these dictionaries define a mapping from user/movie id to to user/movie number (contiguous from zero)
+    udict = {}
+    for i, u in enumerate(np.unique(data[:, 0]).tolist()):
+        udict[u] = i
+    mdict = {}
+    for i, m in enumerate(np.unique(data[:, 1]).tolist()):
+        mdict[m] = i
+
+    # shuffle indices
+    idx = np.arange(n_r)
+    np.random.shuffle(idx)
+
+    trainRatings = np.zeros((n_u, n_m), dtype='float32')
+    validRatings = np.zeros((n_u, n_m), dtype='float32')
+
+    for i in range(n_r):
+        u_id = data[idx[i], 0]
+        m_id = data[idx[i], 1]
+        r = data[idx[i], 2]
+
+        # the first few ratings of the shuffled data array are validation data
+        if i <= valfrac * n_r:
+            validRatings[udict[u_id], mdict[m_id]] = int(r)
+        # the rest are training data
+        else:
+            trainRatings[udict[u_id], mdict[m_id]] = int(r)
+
+    if transpose:
+        trainRatings = trainRatings.T
+        validRatings = validRatings.T
+
+    print('loaded dense data matrix')
+
+    return trainRatings, validRatings
+
--- a/kernelNet_ml1m.py
+++ b/kernelNet_ml1m.py
@ -0,0 +1,136 @@
+'''
+written by Lorenz Muller
+'''
+
+import numpy as np
+import tensorflow as tf
+from time import time
+import sys
+from dataLoader import loadData
+import os
+
+seed = int(time())
+np.random.seed(seed)
+
+
+# load data
+tr, vr = loadData('./ml-1m/ratings.dat', delimiter='::',
+                  seed=seed, transpose=True, valfrac=0.1)
+
+tm = np.greater(tr, 1e-12).astype('float32')  # masks indicating non-zero entries
+vm = np.greater(vr, 1e-12).astype('float32')
+
+n_m = tr.shape[0]  # number of movies
+n_u = tr.shape[1]  # number of users (may be switched depending on 'transpose' in loadData)
+
+# Set hyper-parameters
+n_hid = 500
+lambda_2 = float(sys.argv[1]) if len(sys.argv) > 1 else 60.
+lambda_s = float(sys.argv[2]) if len(sys.argv) > 2 else 0.013
+n_layers = 2
+output_every = 50  # evaluate performance on test set; breaks l-bfgs loop
+n_epoch = n_layers * 10 * output_every
+verbose_bfgs = True
+use_gpu = True
+if not use_gpu:
+    os.environ['CUDA_VISIBLE_DEVICES'] = ''
+    
+# Input placeholders
+R = tf.placeholder("float", [None, n_u])
+
+
+# define network functions
+def kernel(u, v):
+    """
+    Sparsifying kernel function
+
+    :param u: input vectors [n_in, 1, n_dim]
+    :param v: output vectors [1, n_hid, n_dim]
+    :return: input to output connection matrix
+    """
+    dist = tf.norm(u - v, ord=2, axis=2)
+    hat = tf.maximum(0., 1. - dist**2)
+    return hat
+
+
+def kernel_layer(x, n_hid=500, n_dim=5, activation=tf.nn.sigmoid, lambda_s=lambda_s,
+                 lambda_2=lambda_2, name=''):
+    """
+    a kernel sparsified layer
+
+    :param x: input [batch, channels]
+    :param n_hid: number of hidden units
+    :param n_dim: number of dimensions to embed for kernelization
+    :param activation: output activation
+    :param name: layer name for scoping
+    :return: layer output, regularization term
+    """
+
+    # define variables
+    with tf.variable_scope(name):
+        W = tf.get_variable('W', [x.shape[1], n_hid])
+        n_in = x.get_shape().as_list()[1]
+        u = tf.get_variable('u', initializer=tf.random_normal([n_in, 1, n_dim], 0., 1e-3))
+        v = tf.get_variable('v', initializer=tf.random_normal([1, n_hid, n_dim], 0., 1e-3))
+        b = tf.get_variable('b', [n_hid])
+
+    # compute sparsifying kernel
+    # as u and v move further from each other for some given pair of neurons, their connection
+    # decreases in strength and eventually goes to zero.
+    w_hat = kernel(u, v)
+
+    # compute regularization terms
+    sparse_reg = tf.contrib.layers.l2_regularizer(lambda_s)
+    sparse_reg_term = tf.contrib.layers.apply_regularization(sparse_reg, [w_hat])
+
+    l2_reg = tf.contrib.layers.l2_regularizer(lambda_2)
+    l2_reg_term = tf.contrib.layers.apply_regularization(l2_reg, [W])
+
+    # compute output
+    W_eff = W * w_hat
+    y = tf.matmul(x, W_eff) + b
+    y = activation(y)
+    return y, sparse_reg_term + l2_reg_term
+
+
+# Instantiate network
+y = R
+reg_losses = None
+for i in range(n_layers):
+    y, reg_loss = kernel_layer(y, n_hid, name=str(i))
+    reg_losses = reg_loss if reg_losses is None else reg_losses + reg_loss
+prediction, reg_loss = kernel_layer(y, n_u, activation=tf.identity, name='out')
+reg_losses = reg_losses + reg_loss
+
+# Compute loss (symbolic)
+diff = tm*(R - prediction)
+sqE = tf.nn.l2_loss(diff)
+loss = sqE + reg_losses
+
+# Instantiate L-BFGS Optimizer
+optimizer = tf.contrib.opt.ScipyOptimizerInterface(loss, options={'maxiter': output_every,
+                                                                  'disp': verbose_bfgs,
+                                                                  'maxcor': 10},
+                                                   method='L-BFGS-B')
+
+# Training and validation loop
+init = tf.global_variables_initializer()
+with tf.Session() as sess:
+    sess.run(init)
+    for i in range(int(n_epoch / output_every)):
+        optimizer.minimize(sess, feed_dict={R: tr}) #do maxiter optimization steps
+        pre = sess.run(prediction, feed_dict={R: tr}) #predict ratings
+
+        error = (vm * (np.clip(pre, 1., 5.) - vr) ** 2).sum() / vm.sum() #compute validation error
+        error_train = (tm * (np.clip(pre, 1., 5.) - tr) ** 2).sum() / tm.sum() #compute train error
+
+        print('.-^-._' * 12)
+        print('epoch:', i, 'validation rmse:', np.sqrt(error), 'train rmse:', np.sqrt(error_train))
+        print('.-^-._' * 12)
+
+    with open('summary_ml1m.txt', 'a') as file:
+        for a in sys.argv[1:]:
+            file.write(a + ' ')
+        file.write(str(np.sqrt(error)) + ' ' + str(np.sqrt(error_train))
+                   + ' ' + str(seed) + '\n')
+        file.close()
--- a/ml-1m/README
+++ b/ml-1m/README
@ -0,0 +1,170 @@
+SUMMARY
+================================================================================
+
+These files contain 1,000,209 anonymous ratings of approximately 3,900 movies 
+made by 6,040 MovieLens users who joined MovieLens in 2000.
+
+USAGE LICENSE
+================================================================================
+
+Neither the University of Minnesota nor any of the researchers
+involved can guarantee the correctness of the data, its suitability
+for any particular purpose, or the validity of results based on the
+use of the data set.  The data set may be used for any research
+purposes under the following conditions:
+
+     * The user may not state or imply any endorsement from the
+       University of Minnesota or the GroupLens Research Group.
+
+     * The user must acknowledge the use of the data set in
+       publications resulting from the use of the data set
+       (see below for citation information).
+
+     * The user may not redistribute the data without separate
+       permission.
+
+     * The user may not use this information for any commercial or
+       revenue-bearing purposes without first obtaining permission
+       from a faculty member of the GroupLens Research Project at the
+       University of Minnesota.
+
+If you have any further questions or comments, please contact GroupLens
+<grouplens-info@cs.umn.edu>. 
+
+CITATION
+================================================================================
+
+To acknowledge use of the dataset in publications, please cite the following
+paper:
+
+F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History
+and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4,
+Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872
+
+
+ACKNOWLEDGEMENTS
+================================================================================
+
+Thanks to Shyong Lam and Jon Herlocker for cleaning up and generating the data
+set.
+
+FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
+================================================================================
+
+The GroupLens Research Project is a research group in the Department of 
+Computer Science and Engineering at the University of Minnesota. Members of 
+the GroupLens Research Project are involved in many research projects related 
+to the fields of information filtering, collaborative filtering, and 
+recommender systems. The project is lead by professors John Riedl and Joseph 
+Konstan. The project began to explore automated collaborative filtering in 
+1992, but is most well known for its world wide trial of an automated 
+collaborative filtering system for Usenet news in 1996. Since then the project 
+has expanded its scope to research overall information filtering solutions, 
+integrating in content-based methods as well as improving current collaborative 
+filtering technology.
+
+Further information on the GroupLens Research project, including research 
+publications, can be found at the following web site:
+        
+        http://www.grouplens.org/
+
+GroupLens Research currently operates a movie recommender based on 
+collaborative filtering:
+
+        http://www.movielens.org/
+
+RATINGS FILE DESCRIPTION
+================================================================================
+
+All ratings are contained in the file "ratings.dat" and are in the
+following format:
+
+UserID::MovieID::Rating::Timestamp
+
+- UserIDs range between 1 and 6040 
+- MovieIDs range between 1 and 3952
+- Ratings are made on a 5-star scale (whole-star ratings only)
+- Timestamp is represented in seconds since the epoch as returned by time(2)
+- Each user has at least 20 ratings
+
+USERS FILE DESCRIPTION
+================================================================================
+
+User information is in the file "users.dat" and is in the following
+format:
+
+UserID::Gender::Age::Occupation::Zip-code
+
+All demographic information is provided voluntarily by the users and is
+not checked for accuracy.  Only users who have provided some demographic
+information are included in this data set.
+
+- Gender is denoted by a "M" for male and "F" for female
+- Age is chosen from the following ranges:
+
+	*  1:  "Under 18"
+	* 18:  "18-24"
+	* 25:  "25-34"
+	* 35:  "35-44"
+	* 45:  "45-49"
+	* 50:  "50-55"
+	* 56:  "56+"
+
+- Occupation is chosen from the following choices:
+
+	*  0:  "other" or not specified
+	*  1:  "academic/educator"
+	*  2:  "artist"
+	*  3:  "clerical/admin"
+	*  4:  "college/grad student"
+	*  5:  "customer service"
+	*  6:  "doctor/health care"
+	*  7:  "executive/managerial"
+	*  8:  "farmer"
+	*  9:  "homemaker"
+	* 10:  "K-12 student"
+	* 11:  "lawyer"
+	* 12:  "programmer"
+	* 13:  "retired"
+	* 14:  "sales/marketing"
+	* 15:  "scientist"
+	* 16:  "self-employed"
+	* 17:  "technician/engineer"
+	* 18:  "tradesman/craftsman"
+	* 19:  "unemployed"
+	* 20:  "writer"
+
+MOVIES FILE DESCRIPTION
+================================================================================
+
+Movie information is in the file "movies.dat" and is in the following
+format:
+
+MovieID::Title::Genres
+
+- Titles are identical to titles provided by the IMDB (including
+year of release)
+- Genres are pipe-separated and are selected from the following genres:
+
+	* Action
+	* Adventure
+	* Animation
+	* Children's
+	* Comedy
+	* Crime
+	* Documentary
+	* Drama
+	* Fantasy
+	* Film-Noir
+	* Horror
+	* Musical
+	* Mystery
+	* Romance
+	* Sci-Fi
+	* Thriller
+	* War
+	* Western
+
+- Some MovieIDs do not correspond to a movie due to accidental duplicate
+entries and/or test entries
+- Movies are mostly entered by hand, so errors and inconsistencies may exist
--- a/ml-1m/movies.dat
+++ b/ml-1m/movies.dat
--- a/ml-1m/ratings.dat
+++ b/ml-1m/ratings.dat
--- a/ml-1m/users.dat
+++ b/ml-1m/users.dat