Data pipelines

Table data types

class tflon.data.tables.BlockReduceTable(data)

Specifies blocks of a tensor for segment operations.

Input dataframes must contain a single column of list-type elements containing the size of each reduction block.

class tflon.data.tables.DenseNestedTable(data)

Table which converts dataframes containing list type elements into 2-d tensors by flattening lists within a column across rows.

All lists in a row must have the same length.

For example, for molecule atom data, the following form is used:

ID col1 col2 col3 mol1 [atom1 atom2 …] [atom1 atom2 …] [atom1 atom2 …]

class tflon.data.tables.PaddedNestedTable(data, padded_width=None, dtype=<type 'numpy.float32'>)

Table which supports storing ragged arrays using a padding strategy. Converts to a sparse output.

Input dataframes must contain a single column of list-type entries.

class tflon.data.tables.RaggedTable(data, dtype=<type 'numpy.float32'>)

Table for managing ragged arrays without using a padding strategy. Values are converted to RaggedTensorValue for input via a RaggedTensor placeholder,which can be sliced by the ragged array index number to extract individual ragged elements.

Dataframes must contain a single column of list-type elements.

class tflon.data.tables.SparseIndexTable(data, code)

Used for indexing very large tensors, such as a corpus of words, or a list of diagnosis codes.

This can be used to gather slices from a very large embedding tensor.

Input dataframes must contain a single column of list-type entries containing integer indexes.

Converts to a sparse binary matrix.

class tflon.data.tables.Table(data)

The basic building block of the tflon data model. Tables are essentially a wrapper for pandas DataFrames that specify the transformation mapping the dataframe onto input tensors of a model. Examples should have at most a single row per table, with a single index value. Multiple extensions to Table are available to handle specific data types. For example, node features of a graph can be handled by DenseNestedTable, which represents each graph as a single row with nested lists for node features.

Parquet helper functions

tflon.data.parquet.build_dataframe(columns, dtypes, index_col='ID', index_type=<type 'int'>)

Build a dataframe with pre-specified column types.

Parameters:
  • columns (list) – String names of columns
  • dtypes – If list or tuple, dtypes for columns in same ordering If dict, dtypes for columns with column names as keys Otherwise, dtypes is assumed to be a type to assign all columns:
Keyword Arguments:
 
  • index_col (str) – The name of the index
  • index_type (type) – The type of the index
tflon.data.parquet.read_parquet(source, columns=None, nthreads=4, use_pandas_metadata=True)

Read data from a parquet file

Parameters:

source (str) – A filepath to read parquet data

Keyword Arguments:
 
  • columns (None or list) – A list of columns to read. If None, read all columns (default=None)
  • nthreads (int) – Number of threads for reading large parquet files (default=4)
  • use_pandas_metadata (bool) – (default=True)
Returns:

pandas.DataFrame

tflon.data.parquet.write_parquet(df, destination)

Read data from a pandas.DataFrame to parquet

Parameters:
  • df (pandas.DataFrame) – A pandas dataframe containing pyarrow parquet-compatible data
  • destination (str) – A filepath to write parquet data

Data feeds

class tflon.data.feeds.Fetchable

Interface for implementing data sources such as queues and loaders.

Fetchable.fetch returns a dictionary of data inputs.

class tflon.data.feeds.FetchableGroup(fetchables=[])

Interface for implementing grouped data sources, calls Fetchable.fetch on each member fetchable to construct a dictionary of outputs

class tflon.data.feeds.IndexQueue(IDlist, epoch_limit=9223372036854775807, shuffle=True)

IndexQueue provides a thread-safe, optionally eternal, optionally randomized iterator over IDs in a dataset.

The number of epochs can be tracked with the IndexQueue.epochs attribute

class tflon.data.feeds.PersistentTensorManager(feed)

Deletes persistent tensors loaded into a session after they have been expired.

class tflon.data.feeds.QueueCoordinator(model, source, queue, processes, timeout=1, limit=10)

Construct a coordinator that orchestrates subprocess featurizers

Parameters:
  • model (tflon.Model) – Source for featurize preprocessing
  • source (iterator) – Iterable which returns tensor dictionaries
  • queue (TensorQueue) – A tensorflow queue for loading session feed_dict
  • processes (int) – Number of processes
  • timeout (float) – Timeout for queue ops, in seconds (default=1)
  • limit (int) – Maximum number of queue elements per subprocess
class tflon.data.feeds.RaggedTensor(values, lengths, dense_shape)

Implementation of ragged tensor input placeholder. This is an alternative to SparseTensor, which uses a tensor of ragged lengths used for slicing variable length vectors.

This is not currently designed to be used directly in any ops, but should be used by selecting vectors with RaggedTensor.slice

Parameters:
  • values (tensor-like) – Values of the sparse tensor of dimension V
  • dense_shape (tensor-like) – The dense shape of the tensor
  • ragged_shapes (tensor-like) – A tensor of ragged shapes with number of dimensions N-1
class tflon.data.feeds.RaggedTensorValue(values, lengths, dense_shape)

Value wrapper used to pass data to RaggedTensor model inputs.

Parameters:
  • values (np.array) – A 1-D numpy array containing the values of the ragged tensor
  • lengths (np.array) – A 1-D array specifying the length of each vector in the ragged tensor
  • dense_shape (tuple) – The dense shape of the ragged tensor, usually equal to (max(lengths), len(lengths))
class tflon.data.feeds.SparseTensor(indices, values, dense_shape, infered_shape)

This is an extension to tf.SparseTensor which enables shape inference for e.g setting weight shapes in matmul ops.

Parameters:
  • indices (tensor-like) – 2-D tensor of int64 indices, locations of non-zero values
  • values (tensor-like) – 1-D tensor of non-zero values
  • dense_shape (tensor-like) – The final dense shape of the tensor at feed time. Differs from infered_shape
  • infered_shape (tensor-like) – The infered shape of the sparse tensor, returned by SparseTensor.get_shape()
get_shape()

Get the TensorShape representing the shape of the dense tensor.

Returns:A TensorShape object.
class tflon.data.feeds.TableFeed(table_map, master_table=None, aligned=False)

The standard high level data preprocessor for tflon. Use this to define input data to Model.fit and Model.infer

Parameters:table_map (dict) – Dictionary of str -> tflon.data.Table objects defining the raw data tables
Keyword Arguments:
 master_table (str) – The key for the table defining the master index
align()

Align all the examples by the master table index.

Skip if already aligned.

batch(indices)

Get a subset of examples as a new TableFeed

Parameters:indices (iterable) – The indices of examples to include in the batch
drop(drop_set, errors='ignore')

Drop a specified set of examples from this feed

Parameters:drop_set (iterable) – Indices of examples to drop
epoch()

Get the current data epoch as reached by TableFeed.shuffle

holdout(holdout_set)

Return a specified holdout set of examples as a new TableFeed and drop those examples from this feed

Parameters:holdout_set (iterable) – Indices of holdout examples
index

Get the master index of this table feed

iterate(batch_size)

Construct an in-order iterator over mini-batches of examples. Limited to one epoch.

Parameters:batch_size (int) – The number of examples per batch
merge(*feeds)

Merge this table feed with other feeds

sample()

Return a function which can be used to sample batches of varying size from this feed

shuffle(batch_size)

Construct an eternal iterator over shuffled mini-batches of examples.

Parameters:batch_size (int) – The number of examples per batch
xval(folds=None, groups=None, shuffle=True)

Generate datasets for xval

Keyword Arguments:
 
  • folds (int) – The number of folds, if None, then perform leave-one-out (default=None)
  • groups (DataFrame) – A dataframe containing a single column specifying holdout groups for each fold (folds are numbered from 0 to n-1)
  • shuffle (boolean) – Whether to shuffle the data or use the index ordering (default=True)
Returns
iterable: A generator yielding tuples of (holdout, remainder)
class tflon.data.feeds.TensorLoader(tower, device=None, name='TensorLoader')

Loads data tensors into the session, avoiding costly data copying in feed_dict.

This class is used for training with global optimizers, or other data operations. TensorLoader is flexible, in that a subset of inputs to the model may be passed to TensorLoader.load

See TensorQueue and QueueCoordinator for training models with stochastic minibatch optimizers.

>>> TL = TensorLoader( tower, name="TensorLoader_%s" % (tower.name) )
>>> B = datapipeline.next()
>>> TL.load(B)
>>> S.run( tower.get_outputs(), feed_dict=TL.feed() )
fetch()

Get the last generated feed dictionary

load(tensor_map)

Loads a tensor dict into the tensorflow session and return a feed_dict with tensor handles.

TensorLoader keeps a handle to loaded tensors to prevent garbage collection between session evaluations.

Parameters:tensor_map (dict) – (name, tensor) pairs containing input/target names and compatible data (e.g numpy arrays)
Returns:A dictionary of (name, persistent tensor handle) key value pairs
class tflon.data.feeds.TensorQueue(tower, capacity=10, staging=False, timeout=None, device=None, name='TensorQueue')
class tflon.data.feeds.ThreadCoordinator(model, source, queue, timeout=1)

Construct a coordinator that performs featurization on a thread

Parameters:
  • model (tflon.Model) – Source for featurize preprocessing
  • source (iterator) – Iterable which returns tensor dictionaries
  • queue (TensorQueue) – A tensorflow queue for loading session feed_dict
tflon.data.feeds.convert_sparse_matrix_to_sparse_tensor(X)

Convert sparse scipy matrices to tf.SparseTensorValue. csr and csc matrices are converted to coo first.

tflon.data.feeds.convert_to_feedable(X)

Converts input to a type valid for input to the tensorflow.Session.run feed_dict argument

Parameters:X – A feedable type (valid types: tflon.data.Table, numpy.ndarray, scipy.sparse.csr_matrix, scipy.sparse.coo.coo_matrix, tf.Tensor, tf.SparseTensorValue)
Returns:np.array, tf.Tensor, or tf.SparseTensorValue
Raises:ValueError – If X cannot be converted to a feedable type

Schema definitions

class tflon.data.schema.Schema(schema)

Wrapper for schema specifications returned by tflon.model.Model.schema

Parameters:schema (dict) – dictionary of str -> (filenames, type[, reader]). Keys are string table names corresponding to model inputs, filenames is a string or list of strings specifying the names of files to be loaded from the shard directory, type is a class inheriting from Table, reader (optional) is a function with signature reader(filename) -> pandas.DataFrame (default = tflon.data.read_parquet).
load(*directories)

Load tables from a shard using this schema.

Parameters:*directories (str) – paths to the shard directories containing serialized tables
Keyword Arguments:
 reader (callable) – A function used for loading serialized tables. Signature: reader(filename), default: tflon.data.read_parquet
Returns:Map of name -> Table, the loaded table data
Return type:dict

Output transformations

class tflon.data.output.CopyIndex(template_table, columns=None)

Add an index to a 2-D tensor

Parameters:template_table (str) – The name of the input table used to set the indexes
class tflon.data.output.DenseToNested(template_table, columns=None)

Convert a 2-D tensor to a nested tensor, the number of entries in each nested list is decided by a template table

Parameters:template_table (str) – The name of the input table used to set nest sizes and indexes
tflon.data.output.IdentityTransform

alias of tflon.data.output.CopyIndex

class tflon.data.output.TensorTransform

Interface class for output tensor transformations.