loader¶
Implements a general purpose data loader for python non-sequential machine learning tasks. Several common data transformations are provided in this module, e.g., tfidf, whitening, etc.
API¶
Proposed Extensions¶
DataSet.split(scheme={devtraintest, crossvalidate, traintest} returns DataSets
DataSets.join() returns DataSet (combines train or cross validation)
DataSet + DataSet returns DataSet
DataSets + DataSets returns DataSets
DataSets constructor from list of DataSet objects
DataSet for Online data
DataSet for Sequence data
Binary data formats for Streaming data
Exceptions¶
-
exception
loader.
BadDirectoryStructureError
[source]¶ Raised when a data directory specified, does not contain a subfolder specified in the folders argument to
read_data_sets
.
-
class
loader.
DataSet
(features, labels=None, mix=False)[source]¶ Data structure for mini-batch gradient descent training involving non-sequential data.
Parameters: - features – (dict) A dictionary of string label names to data matrices.
Matrices may be of types
IndexVector
, scipy sparse csr_matrix, or numpy array. - labels – (dict) A dictionary of string label names to data matrices.
Matrices may be of types
IndexVector
, scipy sparse csr_matrix, or numpy array. - mix – (boolean) Whether or not to shuffle per epoch.
Examples: >>> import numpy as np >>> from antk.core.loader import DataSet >>> d = DataSet({'id': np.eye(5)}, labels={'ones':np.ones((5, 2))}) >>> d antk.core.DataSet object with fields: '_labels': {'ones': array([[ 1., 1.], [ 1., 1.], [ 1., 1.], [ 1., 1.], [ 1., 1.]])} 'mix_after_epoch': False '_num_examples': 5 '_index_in_epoch': 0 '_last_batch_size': 5 '_features': {'id': array([[ 1., 0., 0., 0., 0.], [ 0., 1., 0., 0., 0.], [ 0., 0., 1., 0., 0.], [ 0., 0., 0., 1., 0.], [ 0., 0., 0., 0., 1.]])}
>>> d.show() features: id: (5, 5) <type 'numpy.ndarray'> labels: ones: (5, 2) <type 'numpy.ndarray'>
>>> d.next_batch(3) antk.core.DataSet object with fields: '_labels': {'ones': array([[ 1., 1.], [ 1., 1.], [ 1., 1.]])} 'mix_after_epoch': False '_num_examples': 3 '_index_in_epoch': 0 '_last_batch_size': 3 '_features': {'id': array([[ 1., 0., 0., 0., 0.], [ 0., 1., 0., 0., 0.], [ 0., 0., 1., 0., 0.]])}
-
features
¶ Attribute: (dict) A dictionary with string keys and feature matrix values.
-
index_in_epoch
¶ Attribute: (int) The number of data points that have been trained on in a particular epoch.
-
labels
¶ Attribute: (dict) A dictionary with string keys and label matrix values.
-
next_batch
(batch_size)[source]¶ Method: - Return a sub DataSet of next batch-size examples.
- If no shuffling (mix=False):
- If batch_size is greater than the number of examples left in
the epoch then a batch size DataSet wrapping past beginning
(rows [index_in_epcoch:num_examples, 0:
num_examples
-index_in_epoch
] will be returned. - If shuffling enabled (mix=True):
- If batch_size is greater than the number of examples left in the epoch, points will be shuffled and batch_size DataSet is returned starting from index 0.
Parameters: batch_size – (int) The number of rows in the matrices of the sub DataSet. Returns: DataSet
-
reset_index_to_zero
()[source]¶ Method: Sets index_in_epoch
to 0.
- features – (dict) A dictionary of string label names to data matrices.
Matrices may be of types
-
class
loader.
DataSets
(datasets_map={}, mix=False)[source]¶ A record of
DataSet
objects.Parameters: - datasets_map – (dict) A dictionary with string keys and DataSet objects as values.
- mix – (boolean) Whether or not to enable shuffling for mini-batching.
Attributes: (
DataSet
) There is an attribute for each key value pair indatasets_map argument.
Examples: >>> import numpy as np >>> from antk.core.loader import DataSets >>> from antk.core.loader import DataSet >>> d = DataSets({'train': DataSet({'id': np.eye(5)}, labels={'one': np.ones((5,6))}), ... 'dev': DataSet({'id': 5*np.eye(2)}, labels={'one': 5*np.ones((2,6))})}) >>> d.show() dev: features: id: (2, 2) <type 'numpy.ndarray'> labels: one: (2, 6) <type 'numpy.ndarray'> train: features: id: (5, 5) <type 'numpy.ndarray'> labels: one: (5, 6) <type 'numpy.ndarray'> >>> d.showmore() dev: features: id: First 2 rows: [[ 5. 0.] [ 0. 5.]] labels: one: First 2 rows: [[ 5. 5. 5. 5. 5. 5.] [ 5. 5. 5. 5. 5. 5.]] train: features: id: First 5 rows: [[ 1. 0. 0. 0. 0.] [ 0. 1. 0. 0. 0.] [ 0. 0. 1. 0. 0.] [ 0. 0. 0. 1. 0.] [ 0. 0. 0. 0. 1.]] labels: one: First 5 rows: [[ 1. 1. 1. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1.]]
-
class
loader.
HotIndex
(matrix, dimension=None)[source]¶ Same data structure as
IndexVector
. This is the legacy name.
-
class
loader.
IndexVector
(matrix, dimension=None)[source]¶ Index vector representation of one hot matrix.
Parameters: - matrix – (scipy.sparse.csr_matrix or numpy array) A one hot matrix or vector of on indices of a one hot matrix. If matrix is a vector of indices and no dimension argument is supplied then dimension is set to the maximum index value + 1.
- dimension – (int) The number of columns in the one hot matrix to be represented.
Note
IndexVector objects implement the python sequence protocol, so slicing, indexing and iteration behave as you might expect. Slices of an IndexVector return another IndexVector. Indexing returns an integer. Iteration will loop over all the elements in the
vec
attribute.Examples: >>> import numpy as np >>> from antk.core import loader >>> xhot = np.array([[1,0,0], [1,0,0], [0,1,0], [0,0,1]]) >>> xindex = loader.IndexVector(xhot) >>> xindex.vec array([0, 0, 1, 2]) >>> xindex.dim 3 >>> xindex.hot() <4x3 sparse matrix of type '<type 'numpy.float64'>' with 4 stored elements in Compressed Sparse Row format> >>> xindex.hot().toarray() array([[ 1., 0., 0.], [ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]]) >>> xindex.shape (4, 3) >>> xindex <class 'antk.core.loader.IndexVector'>(shape=(4, 3)) vec=[0, 0, 1, 2] dim=3 >>> xindex[0] 0 >>> xindex[1:3] <class 'antk.core.loader.IndexVector'>(shape=(2, 3)) vec=[0, 1] dim=3 >>> [index+2 for index in xindex] [2, 2, 3, 4]
-
dim
¶ Attribute: (int) The feature dimension (number of columns) of the one hot matrix.
-
shape
¶ Attribute: (tuple) The shape of the one hot matrix encoded.
-
vec
¶ Attribute: (numpy 1d array) The vector of hot indices.
-
exception
loader.
MatFormatError
[source]¶ Raised if the .mat file being read does not contain a variable named data.
-
exception
loader.
SparseFormatError
[source]¶ Raised when reading a plain text file with .sparsetxt extension and there are not three entries per line.
-
exception
loader.
UnsupportedFormatError
[source]¶ Raised when a file is requested to be loaded or saved without one of the supported file extensions.
-
loader.
center
(X, axis=None)[source]¶ Parameters: X – (numpy array or scipy.sparse.csr_matrix) A matrix to center about the mean(over columns axis=0, over rows axis=1, over all entries axis=None) Returns: A matrix with entries centered along the specified axis.
-
loader.
export_data
(filename, data)[source]¶ Decides how to save data by file extension. Raises
UnsupportedFormatError
if extension is not one of the supported extensions (mat, sparse, binary, dense, index). Data contained in .mat files should be saved in a matrix named data.Parameters: - filename – A file of an accepted format representing a matrix.
- data – A numpy array, scipy sparse matrix, or
IndexVector
object.
-
loader.
import_data
(filename)[source]¶ Decides how to load data into python matrices by file extension. Raises
UnsupportedFormatError
if extension is not one of the supported extensions (mat, sparse, binary, dense, sparsetxt, densetxt, index).Parameters: filename – (str) A file of an accepted format representing a matrix. Returns: A numpy matrix, scipy sparse csr_matrix, or any:IndexVector.
-
loader.
is_one_hot
(A)[source]¶ Parameters: A – A 2-d numpy array or scipy sparse matrix
Returns: True if matrix is a sparse matrix of one hot vectors, False otherwise
Examples: >>> import numpy as np >>> from antk.core import loader >>> x = np.eye(3) >>> loader.is_one_hot(x) True >>> x *= 5 >>> loader.is_one_hot(x) False >>> x = np.array([[1, 0, 0], [1, 0, 0], [1, 0, 0]]) >>> loader.is_one_hot(x) True >>> x[0,1] = 2 >>> loader.is_one_hot(x) False
-
loader.
l1normalize
(X, axis=1)[source]¶ axis=1 normalizes each row of X by norm of said row. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k |X_{ik}|}\)
axis=0 normalizes each column of X by norm of said column. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k |X_{kj}|}\)
Parameters: - X – A scipy sparse csr_matrix or numpy array.
- axis – The dimension to normalize over.
Returns: A normalized matrix.
Raise: ValueError
-
loader.
l2normalize
(X, axis=1)[source]¶ axis=1 normalizes each row of X by norm of said row. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k X_{ ik}^2}}\)
axis=0 normalizes each column of X by norm of said column. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k X_{kj}^2}}\)
Parameters: - X – A scipy sparse csr_matrix or numpy array.
- axis – The dimension to normalize over.
Returns: A normalized matrix.
Raise: ValueError
-
loader.
load
(filename)[source]¶ Calls
import_data
. Decides how to load data into python matrices by file extension. RaisesUnsupportedFormatError
if extension is not one of the supported extensions (mat, sparse, binary, dense, sparsetxt, densetxt, index).Parameters: filename – (str) A file of an accepted format representing a matrix. Returns: A numpy matrix, scipy sparse csr_matrix, or any:IndexVector.
-
loader.
makedirs
(datadirectory, sub_directory_list=('train', 'dev', 'test'))[source]¶ Parameters: - datadirectory – Name of the directory you want to create containing the subdirectory folders. If the directory already exists it will be populated with the subdirectory folders.
- sub_directory_list – The list of subdirectories you want to create
Returns: void
-
loader.
maxnormalize
(X, axis=1)[source]¶ axis=1 normalizes each row of X by norm of said row. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X_{i:})}\)
axis=0 normalizes each column of X by norm of said column. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X_{ :j})}\)
Parameters: - X – A scipy sparse csr_matrix or numpy array.
- axis – The dimension to normalize over.
Returns: A normalized matrix.
Raise: ValueError
-
loader.
maybe_download
(filename, directory, source_url)[source]¶ Download the data from source url, unless it’s already here. From https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/datasets/base.py
Parameters: - filename – string, name of the file in the directory.
- directory – string, path to working directory.
- source_url – url to download from if file doesn’t exist.
Returns: Path to resulting file.
-
loader.
read_data_sets
(directory, folders=('train', 'dev', 'test'), hashlist=(), mix=False)[source]¶ Parameters: - directory – (str) Root directory containing data to load.
- folders – (dict) The subfolders of directory to read data from. By default there are train, dev, and test folders. If you want others you have to make an explicit list.
- hashlist – (dict) If you provide a hashlist these files and
only these files will be added to your
DataSet
objects. It you do not provide a hashlist then anything with the privileged prefixes labels_ or features_ will be loaded. - mix – (boolean) Whether to shuffle during mini-batching.
Returns: A
DataSets
object.Examples: >>> import antk.core.loader as loader >>> import numpy as np >>> loader.makedirs('/tmp/test_data/') >>> loader.save('/tmp/test_data/test/features_id.dense', np.eye(5)) >>> loader.save('/tmp/test_data/test/features_ones.dense', np.ones((5, 2))) >>> loader.save('/tmp/test_data/test/labels_id.dense', np.eye(5)) >>> loader.save('/tmp/test_data/dev/features_id.dense', np.eye(5)) >>> loader.save('/tmp/test_data/dev/features_ones.dense', np.ones((5, 2))) >>> loader.save('/tmp/test_data/dev/labels_id.dense', np.eye(5)) >>> loader.save('/tmp/test_data/train/features_id.dense', np.eye(5)) >>> loader.save('/tmp/test_data/train/features_ones.dense', np.ones((5, 2))) >>> loader.save('/tmp/test_data/train/labels_id.dense', np.eye(5)) >>> loader.read_data_sets('/tmp/test_data').show() reading train... reading dev... reading test... dev: features: ones: (5, 2) <type 'numpy.ndarray'> id: (5, 5) <type 'numpy.ndarray'> labels: id: (5, 5) <type 'numpy.ndarray'> test: features: ones: (5, 2) <type 'numpy.ndarray'> id: (5, 5) <type 'numpy.ndarray'> labels: id: (5, 5) <type 'numpy.ndarray'> train: features: ones: (5, 2) <type 'numpy.ndarray'> id: (5, 5) <type 'numpy.ndarray'> labels: id: (5, 5) <type 'numpy.ndarray'>
>>> loader.read_data_sets('/tmp/test_data', ... folders=['train', 'dev'], ... hashlist=['ones']).show() reading train... reading dev... dev: features: ones: (5, 2) <type 'numpy.ndarray'> labels: train: features: ones: (5, 2) <type 'numpy.ndarray'> labels:
-
loader.
save
(filename, data)[source]¶ Calls :any`export_data`. Decides how to save data by file extension. Raises
UnsupportedFormatError
if extension is not one of the supported extensions (mat, sparse, binary, dense, index). Data contained in .mat files should be saved in a matrix named data.Parameters: filename – (str) A filename with extension of an accepted format for representing a matrix. :param data: numpy array, scipy sparse matrix, or
IndexVector
object.
-
loader.
tfidf
(X, norm='l2')[source]¶ Parameters: - X – (numpy array or scipy.sparse.csr_matrix) A document-term matrix with term counts.
- norm – Normalization strategy: l2row: normalizes the scores of rows by
length of rows after basic tfidf (each document vector is a unit vector), count: normalizes the scores of rows by the the total word count of a document. max normalizes the scores of rows by the maximum count for a single word in a document. :return: Returns tfidf of document-term matrix X with optional normalization.
-
loader.
toIndex
(A)[source]¶ Parameters: A – (numpy array or scipy.sparse.csr_matrix) A matrix of one hot row vectors.
Returns: The hot indices.
Examples: >>> import numpy as np >>> from antk.core import loader >>> x = np.array([[1,0,0], [0,0,1], [1,0,0]]) >>> loader.toIndex(x) array([0, 2, 0])
-
loader.
toOnehot
(X, dim=None)[source]¶ Parameters: - X – (numpy array) Vector of indices or
IndexVector
object - dim – (int) Dimension of indexing
Returns: A sparse csr_matrix of one hots.
Examples: >>> import numpy as np >>> from antk.core import loader >>> x = np.array([0, 1, 2, 3]) >>> loader.toOnehot(x) <4x4 sparse matrix of type '<type 'numpy.float64'>'... >>> loader.toOnehot(x).toarray() array([[ 1., 0., 0., 0.], [ 0., 1., 0., 0.], [ 0., 0., 1., 0.], [ 0., 0., 0., 1.]]) >>> x = loader.IndexVector(x, dimension=8) >>> loader.toOnehot(x).toarray() array([[ 1., 0., 0., 0., 0., 0., 0., 0.], [ 0., 1., 0., 0., 0., 0., 0., 0.], [ 0., 0., 1., 0., 0., 0., 0., 0.], [ 0., 0., 0., 1., 0., 0., 0., 0.]])
- X – (numpy array) Vector of indices or
-
loader.
unit_variance
(X, axis=None)[source]¶ Parameters: - X – (numpy array or scipy.sparse.csr_matrix) A matrix to transform to have unit variance (over columns axis=0, over rows axis=1, over all entries axis=None)
- axis – The axis to perform the transform.
Returns: A matrix with unit variance along the specified axis.