loader

Implements a general purpose data loader for python non-sequential machine learning tasks. Several common data transformations are provided in this module, e.g., tfidf, whitening, etc.

API

Proposed Extensions

DataSet.split(scheme={devtraintest, crossvalidate, traintest} returns DataSets

DataSets.join() returns DataSet (combines train or cross validation)

DataSet + DataSet returns DataSet

DataSets + DataSets returns DataSets

DataSets constructor from list of DataSet objects

DataSet for Online data

DataSet for Sequence data

Binary data formats for Streaming data

Loading, Saving, and Testing

save

load

is_one_hot

read_data_sets

untar

maybe_download

Exceptions

BadDirectoryStructureError

MatFormatError

SparseFormatError

UnsupportedFormatError

exception loader.BadDirectoryStructureError[source]

Raised when a data directory specified, does not contain a subfolder specified in the folders argument to read_data_sets.

class loader.DataSet(features, labels=None, mix=False)[source]

Data structure for mini-batch gradient descent training involving non-sequential data.

Parameters:
  • features – (dict) A dictionary of string label names to data matrices. Matrices may be of types IndexVector, scipy sparse csr_matrix, or numpy array.
  • labels – (dict) A dictionary of string label names to data matrices. Matrices may be of types IndexVector, scipy sparse csr_matrix, or numpy array.
  • mix – (boolean) Whether or not to shuffle per epoch.
Examples:
>>> import numpy as np
>>> from antk.core.loader import DataSet
>>> d = DataSet({'id': np.eye(5)}, labels={'ones':np.ones((5, 2))})
>>> d 
antk.core.DataSet object with fields:
'_labels': {'ones': array([[ 1.,  1.],
                           [ 1.,  1.],
                           [ 1.,  1.],
                           [ 1.,  1.],
                           [ 1.,  1.]])}
'mix_after_epoch': False
'_num_examples': 5
'_index_in_epoch': 0
'_last_batch_size': 5
'_features': {'id': array([[ 1.,  0.,  0.,  0.,  0.],
                           [ 0.,  1.,  0.,  0.,  0.],
                           [ 0.,  0.,  1.,  0.,  0.],
                           [ 0.,  0.,  0.,  1.,  0.],
                           [ 0.,  0.,  0.,  0.,  1.]])}
>>> d.show() 
features:
     id: (5, 5) <type 'numpy.ndarray'>
labels:
     ones: (5, 2) <type 'numpy.ndarray'>
>>> d.next_batch(3) 
antk.core.DataSet object with fields:
    '_labels': {'ones': array([[ 1.,  1.],
                               [ 1.,  1.],
                               [ 1.,  1.]])}
    'mix_after_epoch': False
    '_num_examples': 3
    '_index_in_epoch': 0
    '_last_batch_size': 3
    '_features': {'id': array([[ 1.,  0.,  0.,  0.,  0.],
                               [ 0.,  1.,  0.,  0.,  0.],
                               [ 0.,  0.,  1.,  0.,  0.]])}
features
Attribute:(dict) A dictionary with string keys and feature matrix values.
index_in_epoch
Attribute:(int) The number of data points that have been trained on in a particular epoch.
labels
Attribute:(dict) A dictionary with string keys and label matrix values.
next_batch(batch_size)[source]
Method:
Return a sub DataSet of next batch-size examples.
If no shuffling (mix=False):
If batch_size is greater than the number of examples left in the epoch then a batch size DataSet wrapping past beginning (rows [index_in_epcoch:num_examples, 0:num_examples-index_in_epoch] will be returned.
If shuffling enabled (mix=True):
If batch_size is greater than the number of examples left in the epoch, points will be shuffled and batch_size DataSet is returned starting from index 0.
Parameters:batch_size – (int) The number of rows in the matrices of the sub DataSet.
Returns:DataSet
num_examples
Attribute:(int) Number of rows (data points) of the matrices in this DataSet.
reset_index_to_zero()[source]
Method:Sets index_in_epoch to 0.
show()[source]
Method:Prints the data specs (dimensions, keys, type) in the DataSet object
showmore()[source]
Method:Prints the data specs (dimensions, keys, type) in the DataSet object,

along with a sample of up to the first twenty rows for matrices in DataSet.

shuffle()[source]
Method:The same random permutation is applied to the rows of all the matrices in features and labels .
class loader.DataSets(datasets_map={}, mix=False)[source]

A record of DataSet objects.

Parameters:
  • datasets_map – (dict) A dictionary with string keys and DataSet objects as values.
  • mix – (boolean) Whether or not to enable shuffling for mini-batching.
Attributes:

(DataSet) There is an attribute for each key value pair in

datasets_map argument.

Examples:
>>> import numpy as np
>>> from antk.core.loader import DataSets
>>> from antk.core.loader import DataSet
>>> d = DataSets({'train': DataSet({'id': np.eye(5)}, labels={'one': np.ones((5,6))}),
...               'dev': DataSet({'id': 5*np.eye(2)}, labels={'one': 5*np.ones((2,6))})})
>>> d.show() 
dev:
features:
     id: (2, 2) <type 'numpy.ndarray'>
labels:
     one: (2, 6) <type 'numpy.ndarray'>
train:
features:
     id: (5, 5) <type 'numpy.ndarray'>
labels:
     one: (5, 6) <type 'numpy.ndarray'>
>>> d.showmore() 
dev:
features:
     id:
First 2 rows:
[[ 5.  0.]
 [ 0.  5.]]

labels:
     one:
First 2 rows:
[[ 5.  5.  5.  5.  5.  5.]
 [ 5.  5.  5.  5.  5.  5.]]

train:
features:
     id:
First 5 rows:
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]]

labels:
     one:
First 5 rows:
[[ 1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.]]
show()[source]
Method:Pretty print data attributes.
showmore()[source]
Method:Pretty print data attributes, and data.
class loader.HotIndex(matrix, dimension=None)[source]

Same data structure as IndexVector. This is the legacy name.

class loader.IndexVector(matrix, dimension=None)[source]

Index vector representation of one hot matrix.

Parameters:
  • matrix – (scipy.sparse.csr_matrix or numpy array) A one hot matrix or vector of on indices of a one hot matrix. If matrix is a vector of indices and no dimension argument is supplied then dimension is set to the maximum index value + 1.
  • dimension – (int) The number of columns in the one hot matrix to be represented.

Note

IndexVector objects implement the python sequence protocol, so slicing, indexing and iteration behave as you might expect. Slices of an IndexVector return another IndexVector. Indexing returns an integer. Iteration will loop over all the elements in the vec attribute.

Examples:
>>> import numpy as np
>>> from antk.core import loader
>>> xhot = np.array([[1,0,0], [1,0,0], [0,1,0], [0,0,1]])
>>> xindex = loader.IndexVector(xhot)
>>> xindex.vec
array([0, 0, 1, 2])
>>> xindex.dim
3
>>> xindex.hot() 
<4x3 sparse matrix of type '<type 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
>>> xindex.hot().toarray() 
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])
>>> xindex.shape
(4, 3)
>>> xindex
<class 'antk.core.loader.IndexVector'>(shape=(4, 3))
vec=[0, 0, 1, 2]
dim=3
>>> xindex[0]
0
>>> xindex[1:3]
<class 'antk.core.loader.IndexVector'>(shape=(2, 3))
vec=[0, 1]
dim=3
>>> [index+2 for index in xindex]
[2, 2, 3, 4]
dim
Attribute:(int) The feature dimension (number of columns) of the one hot matrix.
hot()[source]
Method:
Returns:A one hot scipy sparse csr_matrix
shape
Attribute:(tuple) The shape of the one hot matrix encoded.
vec
Attribute:(numpy 1d array) The vector of hot indices.
exception loader.MatFormatError[source]

Raised if the .mat file being read does not contain a variable named data.

exception loader.SparseFormatError[source]

Raised when reading a plain text file with .sparsetxt extension and there are not three entries per line.

exception loader.UnsupportedFormatError[source]

Raised when a file is requested to be loaded or saved without one of the supported file extensions.

loader.center(X, axis=None)[source]
Parameters:X – (numpy array or scipy.sparse.csr_matrix) A matrix to center about the mean(over columns axis=0, over rows axis=1, over all entries axis=None)
Returns:A matrix with entries centered along the specified axis.
loader.export_data(filename, data)[source]

Decides how to save data by file extension. Raises UnsupportedFormatError if extension is not one of the supported extensions (mat, sparse, binary, dense, index). Data contained in .mat files should be saved in a matrix named data.

Parameters:
  • filename – A file of an accepted format representing a matrix.
  • data – A numpy array, scipy sparse matrix, or IndexVector object.
loader.import_data(filename)[source]

Decides how to load data into python matrices by file extension. Raises UnsupportedFormatError if extension is not one of the supported extensions (mat, sparse, binary, dense, sparsetxt, densetxt, index).

Parameters:filename – (str) A file of an accepted format representing a matrix.
Returns:A numpy matrix, scipy sparse csr_matrix, or any:IndexVector.
loader.is_one_hot(A)[source]
Parameters:

A – A 2-d numpy array or scipy sparse matrix

Returns:

True if matrix is a sparse matrix of one hot vectors, False otherwise

Examples:
>>> import numpy as np
>>> from antk.core import loader
>>> x = np.eye(3)
>>> loader.is_one_hot(x)
True
>>> x *= 5
>>> loader.is_one_hot(x)
False
>>> x = np.array([[1, 0, 0], [1, 0, 0], [1, 0, 0]])
>>> loader.is_one_hot(x)
True
>>> x[0,1] = 2
>>> loader.is_one_hot(x)
False
loader.l1normalize(X, axis=1)[source]

axis=1 normalizes each row of X by norm of said row. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k |X_{ik}|}\)

axis=0 normalizes each column of X by norm of said column. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k |X_{kj}|}\)

Parameters:
  • X – A scipy sparse csr_matrix or numpy array.
  • axis – The dimension to normalize over.
Returns:

A normalized matrix.

Raise:

ValueError

loader.l2normalize(X, axis=1)[source]

axis=1 normalizes each row of X by norm of said row. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k X_{ ik}^2}}\)

axis=0 normalizes each column of X by norm of said column. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k X_{kj}^2}}\)

Parameters:
  • X – A scipy sparse csr_matrix or numpy array.
  • axis – The dimension to normalize over.
Returns:

A normalized matrix.

Raise:

ValueError

loader.load(filename)[source]

Calls import_data. Decides how to load data into python matrices by file extension. Raises UnsupportedFormatError if extension is not one of the supported extensions (mat, sparse, binary, dense, sparsetxt, densetxt, index).

Parameters:filename – (str) A file of an accepted format representing a matrix.
Returns:A numpy matrix, scipy sparse csr_matrix, or any:IndexVector.
loader.makedirs(datadirectory, sub_directory_list=('train', 'dev', 'test'))[source]
Parameters:
  • datadirectory – Name of the directory you want to create containing the subdirectory folders. If the directory already exists it will be populated with the subdirectory folders.
  • sub_directory_list – The list of subdirectories you want to create
Returns:

void

loader.maxnormalize(X, axis=1)[source]

axis=1 normalizes each row of X by norm of said row. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X_{i:})}\)

axis=0 normalizes each column of X by norm of said column. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X_{ :j})}\)

Parameters:
  • X – A scipy sparse csr_matrix or numpy array.
  • axis – The dimension to normalize over.
Returns:

A normalized matrix.

Raise:

ValueError

loader.maybe_download(filename, directory, source_url)[source]

Download the data from source url, unless it’s already here. From https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/datasets/base.py

Parameters:
  • filename – string, name of the file in the directory.
  • directory – string, path to working directory.
  • source_url – url to download from if file doesn’t exist.
Returns:

Path to resulting file.

loader.read_data_sets(directory, folders=('train', 'dev', 'test'), hashlist=(), mix=False)[source]
Parameters:
  • directory – (str) Root directory containing data to load.
  • folders – (dict) The subfolders of directory to read data from. By default there are train, dev, and test folders. If you want others you have to make an explicit list.
  • hashlist – (dict) If you provide a hashlist these files and only these files will be added to your DataSet objects. It you do not provide a hashlist then anything with the privileged prefixes labels_ or features_ will be loaded.
  • mix – (boolean) Whether to shuffle during mini-batching.
Returns:

A DataSets object.

Examples:
>>> import antk.core.loader as loader
>>> import numpy as np
>>> loader.makedirs('/tmp/test_data/')
>>> loader.save('/tmp/test_data/test/features_id.dense', np.eye(5))
>>> loader.save('/tmp/test_data/test/features_ones.dense', np.ones((5, 2)))
>>> loader.save('/tmp/test_data/test/labels_id.dense', np.eye(5))
>>> loader.save('/tmp/test_data/dev/features_id.dense', np.eye(5))
>>> loader.save('/tmp/test_data/dev/features_ones.dense', np.ones((5, 2)))
>>> loader.save('/tmp/test_data/dev/labels_id.dense', np.eye(5))
>>> loader.save('/tmp/test_data/train/features_id.dense', np.eye(5))
>>> loader.save('/tmp/test_data/train/features_ones.dense', np.ones((5, 2)))
>>> loader.save('/tmp/test_data/train/labels_id.dense', np.eye(5))
>>> loader.read_data_sets('/tmp/test_data').show() 
reading train...
reading dev...
reading test...
dev:
features:
     ones: (5, 2) <type 'numpy.ndarray'>
     id: (5, 5) <type 'numpy.ndarray'>
labels:
     id: (5, 5) <type 'numpy.ndarray'>
test:
features:
     ones: (5, 2) <type 'numpy.ndarray'>
     id: (5, 5) <type 'numpy.ndarray'>
labels:
     id: (5, 5) <type 'numpy.ndarray'>
train:
features:
     ones: (5, 2) <type 'numpy.ndarray'>
     id: (5, 5) <type 'numpy.ndarray'>
labels:
     id: (5, 5) <type 'numpy.ndarray'>
>>> loader.read_data_sets('/tmp/test_data',
...                       folders=['train', 'dev'],
...                       hashlist=['ones']).show() 
reading train...
reading dev...
dev:
features:
     ones: (5, 2) <type 'numpy.ndarray'>
labels:
train:
features:
     ones: (5, 2) <type 'numpy.ndarray'>
labels:
loader.save(filename, data)[source]

Calls :any`export_data`. Decides how to save data by file extension. Raises UnsupportedFormatError if extension is not one of the supported extensions (mat, sparse, binary, dense, index). Data contained in .mat files should be saved in a matrix named data.

Parameters:filename – (str) A filename with extension of an

accepted format for representing a matrix. :param data: numpy array, scipy sparse matrix, or IndexVector object.

loader.tfidf(X, norm='l2')[source]
Parameters:
  • X – (numpy array or scipy.sparse.csr_matrix) A document-term matrix with term counts.
  • norm – Normalization strategy: l2row: normalizes the scores of rows by

length of rows after basic tfidf (each document vector is a unit vector), count: normalizes the scores of rows by the the total word count of a document. max normalizes the scores of rows by the maximum count for a single word in a document. :return: Returns tfidf of document-term matrix X with optional normalization.

loader.toIndex(A)[source]
Parameters:

A – (numpy array or scipy.sparse.csr_matrix) A matrix of one hot row vectors.

Returns:

The hot indices.

Examples:
>>> import numpy as np
>>> from antk.core import loader
>>> x = np.array([[1,0,0], [0,0,1], [1,0,0]])
>>> loader.toIndex(x)
array([0, 2, 0])
loader.toOnehot(X, dim=None)[source]
Parameters:
  • X – (numpy array) Vector of indices or IndexVector object
  • dim – (int) Dimension of indexing
Returns:

A sparse csr_matrix of one hots.

Examples:
>>> import numpy as np
>>> from antk.core import loader
>>> x = np.array([0, 1, 2, 3])
>>> loader.toOnehot(x) 
<4x4 sparse matrix of type '<type 'numpy.float64'>'...
>>> loader.toOnehot(x).toarray()
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])
>>> x = loader.IndexVector(x, dimension=8)
>>> loader.toOnehot(x).toarray()
array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.]])
loader.unit_variance(X, axis=None)[source]
Parameters:
  • X – (numpy array or scipy.sparse.csr_matrix) A matrix to transform to have unit variance (over columns axis=0, over rows axis=1, over all entries axis=None)
  • axis – The axis to perform the transform.
Returns:

A matrix with unit variance along the specified axis.

loader.untar(fname)[source]

Untar and ungzip a file in the current directory. :param fname: (str) Name of the .tar.gz file