Command Line Scripts

datatest.py

Tool for displaying data using loader.read_data_sets.

usage: datatest [-h] [-hashlist HASHLIST [HASHLIST ...]]
                [-cold | -subfolders SUBFOLDERS [SUBFOLDERS ...]]
                datadirectory
Positional arguments:
datadirectory Path to folder where data to be loaded and displayed is stored.
Options:
-hashlist List of hashes to read. Files will be read of the form “features_<hash>.ext” or”labels_<hash>.ext” where <hash> is a string in hashlist. If a hashlist is not specified all files of the form “features_<hash>.ext” or “labels_<hash>.ext” regardless what string <hash> is will be loaded.
-cold=False Extra loading and testing for cold datasets
-subfolders=('test', 'dev', 'train')
 List of subfolders to load and display.

normalize.py

Given the path to a file, Capitalization and punctuation is removed, except for infix apostrophes, e.g. “hasn’t”, “David’s”. The normalized text is saved with “_norm” appended to the file name before the extension. The normalized text is saved in the same directory as the original text. Beginning and end of sentence tokens are not provided by this normalization script.

usage: normalize [-h] filepath
Positional arguments:
filepath The path to the file including filename

Movie Lens Processing

generateTermDoc.py

usage: generateTermDoc [-h] datapath dictionary descriptions doc_term_file
Positional arguments:
datapath Path to folder where dictionary and descriptions are located, and created document term matrix will be saved.
dictionary Name of the file containing line separated words in vocabulary.
descriptions Name of the file containing line separated text descriptions.
doc_term_file Name of the file to save the created sparse document term matrix.

ml100k_item_process.py

Reads MovieLens 100k item meta data and converts to feature files. features_item_month.index: The produced files are: A file storing a HotIndex object of movie month releases.

features_item_year.mat: A file storing a numpy array of movie year releases.

features_item_genre.mat: A file storing a scipy sparse csr_matrix of one hot encodings for movie genre.

usage: ml100k_item_process [-h] datapath outpath
Positional arguments:
datapath The path to ml-100k dataset. Usually “some_relative_path/ml-100k
outpath The path to the folder to store the processed Movielens 100k item data feature files.

ml100k_user_process.py

Tool to process Movielens 100k user Metadata.

usage: ml100k_user_process [-h] datapath outpath
Positional arguments:
datapath Path to ml-100k
outpath Path to save created files to.