Data pipelines¶

Table data types¶

class tflon.data.tables.BlockReduceTable(data)¶

Specifies blocks of a tensor for segment operations.

Input dataframes must contain a single column of list-type elements containing the size of each reduction block.

class tflon.data.tables.DenseNestedTable(data)¶

Table which converts dataframes containing list type elements into 2-d tensors by flattening lists within a column across rows.

All lists in a row must have the same length.

For example, for molecule atom data, the following form is used:

ID col1 col2 col3 mol1 [atom1 atom2 …] [atom1 atom2 …] [atom1 atom2 …]

class tflon.data.tables.PaddedNestedTable(data, padded_width=None, dtype=<type 'numpy.float32'>)¶

Table which supports storing ragged arrays using a padding strategy. Converts to a sparse output.

Input dataframes must contain a single column of list-type entries.

class tflon.data.tables.RaggedTable(data, dtype=<type 'numpy.float32'>)¶

Table for managing ragged arrays without using a padding strategy. Values are converted to RaggedTensorValue for input via a RaggedTensor placeholder,which can be sliced by the ragged array index number to extract individual ragged elements.

Dataframes must contain a single column of list-type elements.

class tflon.data.tables.SparseIndexTable(data, code)¶

Used for indexing very large tensors, such as a corpus of words, or a list of diagnosis codes.

This can be used to gather slices from a very large embedding tensor.

Input dataframes must contain a single column of list-type entries containing integer indexes.

Converts to a sparse binary matrix.

class tflon.data.tables.Table(data)¶: The basic building block of the tflon data model. Tables are essentially a wrapper for pandas DataFrames that specify the transformation mapping the dataframe onto input tensors of a model. Examples should have at most a single row per table, with a single index value. Multiple extensions to Table are available to handle specific data types. For example, node features of a graph can be handled by DenseNestedTable, which represents each graph as a single row with nested lists for node features.

Parquet helper functions¶

tflon.data.parquet.build_dataframe(columns, dtypes, index_col='ID', index_type=<type 'int'>)¶

Build a dataframe with pre-specified column types.

Keyword Arguments:
Parameters:	columns (list) – String names of columns dtypes – If list or tuple, dtypes for columns in same ordering If dict, dtypes for columns with column names as keys Otherwise, dtypes is assumed to be a type to assign all columns:
	index_col (str) – The name of the index index_type (type) – The type of the index

tflon.data.parquet.read_parquet(source, columns=None, nthreads=4, use_pandas_metadata=True)¶

Read data from a parquet file

Keyword Arguments:
Parameters:	source (str) – A filepath to read parquet data
	columns (None or list) – A list of columns to read. If None, read all columns (default=None) nthreads (int) – Number of threads for reading large parquet files (default=4) use_pandas_metadata (bool) – (default=True)
Returns:	pandas.DataFrame

tflon.data.parquet.write_parquet(df, destination)¶

Read data from a pandas.DataFrame to parquet

Parameters:	df (pandas.DataFrame) – A pandas dataframe containing pyarrow parquet-compatible data destination (str) – A filepath to write parquet data

Data feeds¶

class tflon.data.feeds.Fetchable¶

Interface for implementing data sources such as queues and loaders.

Fetchable.fetch returns a dictionary of data inputs.

class tflon.data.feeds.FetchableGroup(fetchables=[])¶: Interface for implementing grouped data sources, calls Fetchable.fetch on each member fetchable to construct a dictionary of outputs

class tflon.data.feeds.IndexQueue(IDlist, epoch_limit=9223372036854775807, shuffle=True)¶

IndexQueue provides a thread-safe, optionally eternal, optionally randomized iterator over IDs in a dataset.

The number of epochs can be tracked with the IndexQueue.epochs attribute

class tflon.data.feeds.PersistentTensorManager(feed)¶: Deletes persistent tensors loaded into a session after they have been expired.

class tflon.data.feeds.QueueCoordinator(model, source, queue, processes, timeout=1, limit=10)¶

Construct a coordinator that orchestrates subprocess featurizers

Parameters:

model (tflon.Model) – Source for featurize preprocessing
source (iterator) – Iterable which returns tensor dictionaries
queue (TensorQueue) – A tensorflow queue for loading session feed_dict
processes (int) – Number of processes
timeout (float) – Timeout for queue ops, in seconds (default=1)
limit (int) – Maximum number of queue elements per subprocess

class tflon.data.feeds.RaggedTensor(values, lengths, dense_shape)¶

Implementation of ragged tensor input placeholder. This is an alternative to SparseTensor, which uses a tensor of ragged lengths used for slicing variable length vectors.

This is not currently designed to be used directly in any ops, but should be used by selecting vectors with RaggedTensor.slice

Parameters:	values (tensor-like) – Values of the sparse tensor of dimension V dense_shape (tensor-like) – The dense shape of the tensor ragged_shapes (tensor-like) – A tensor of ragged shapes with number of dimensions N-1

class tflon.data.feeds.RaggedTensorValue(values, lengths, dense_shape)¶

Value wrapper used to pass data to RaggedTensor model inputs.

Parameters:	values (np.array) – A 1-D numpy array containing the values of the ragged tensor lengths (np.array) – A 1-D array specifying the length of each vector in the ragged tensor dense_shape (tuple) – The dense shape of the ragged tensor, usually equal to (max(lengths), len(lengths))

class tflon.data.feeds.SparseTensor(indices, values, dense_shape, infered_shape)¶

This is an extension to tf.SparseTensor which enables shape inference for e.g setting weight shapes in matmul ops.

Parameters:	indices (tensor-like) – 2-D tensor of int64 indices, locations of non-zero values values (tensor-like) – 1-D tensor of non-zero values dense_shape (tensor-like) – The final dense shape of the tensor at feed time. Differs from infered_shape infered_shape (tensor-like) – The infered shape of the sparse tensor, returned by SparseTensor.get_shape()

get_shape()¶

Get the TensorShape representing the shape of the dense tensor.

Returns:	A TensorShape object.

class tflon.data.feeds.TableFeed(table_map, master_table=None, aligned=False)¶

The standard high level data preprocessor for tflon. Use this to define input data to Model.fit and Model.infer

Keyword Arguments:
Parameters:	table_map (dict) – Dictionary of str -> tflon.data.Table objects defining the raw data tables
	master_table (str) – The key for the table defining the master index

align()¶

Align all the examples by the master table index.

Skip if already aligned.

batch(indices)¶

Get a subset of examples as a new TableFeed

Parameters:	indices (iterable) – The indices of examples to include in the batch

drop(drop_set, errors='ignore')¶

Drop a specified set of examples from this feed

Parameters:	drop_set (iterable) – Indices of examples to drop

epoch()¶: Get the current data epoch as reached by TableFeed.shuffle

holdout(holdout_set)¶

Return a specified holdout set of examples as a new TableFeed and drop those examples from this feed

Parameters:	holdout_set (iterable) – Indices of holdout examples

index¶: Get the master index of this table feed

iterate(batch_size)¶

Construct an in-order iterator over mini-batches of examples. Limited to one epoch.

Parameters:	batch_size (int) – The number of examples per batch

merge(*feeds)¶: Merge this table feed with other feeds

sample()¶: Return a function which can be used to sample batches of varying size from this feed

shuffle(batch_size)¶

Construct an eternal iterator over shuffled mini-batches of examples.

Parameters:	batch_size (int) – The number of examples per batch

xval(folds=None, groups=None, shuffle=True)¶

Generate datasets for xval

Keyword Arguments:
	folds (int) – The number of folds, if None, then perform leave-one-out (default=None) groups (DataFrame) – A dataframe containing a single column specifying holdout groups for each fold (folds are numbered from 0 to n-1) shuffle (boolean) – Whether to shuffle the data or use the index ordering (default=True)

Returns: iterable: A generator yielding tuples of (holdout, remainder)

class tflon.data.feeds.TensorLoader(tower, device=None, name='TensorLoader')¶

Loads data tensors into the session, avoiding costly data copying in feed_dict.

This class is used for training with global optimizers, or other data operations. TensorLoader is flexible, in that a subset of inputs to the model may be passed to TensorLoader.load

See TensorQueue and QueueCoordinator for training models with stochastic minibatch optimizers.

>>> TL = TensorLoader( tower, name="TensorLoader_%s" % (tower.name) )
>>> B = datapipeline.next()
>>> TL.load(B)
>>> S.run( tower.get_outputs(), feed_dict=TL.feed() )

fetch()¶: Get the last generated feed dictionary

load(tensor_map)¶

Loads a tensor dict into the tensorflow session and return a feed_dict with tensor handles.

TensorLoader keeps a handle to loaded tensors to prevent garbage collection between session evaluations.

Parameters:	tensor_map (dict) – (name, tensor) pairs containing input/target names and compatible data (e.g numpy arrays)
Returns:	A dictionary of (name, persistent tensor handle) key value pairs

class tflon.data.feeds.TensorQueue(tower, capacity=10, staging=False, timeout=None, device=None, name='TensorQueue')¶

class tflon.data.feeds.ThreadCoordinator(model, source, queue, timeout=1)¶

Construct a coordinator that performs featurization on a thread

Parameters:	model (tflon.Model) – Source for featurize preprocessing source (iterator) – Iterable which returns tensor dictionaries queue (TensorQueue) – A tensorflow queue for loading session feed_dict

tflon.data.feeds.convert_sparse_matrix_to_sparse_tensor(X)¶: Convert sparse scipy matrices to tf.SparseTensorValue. csr and csc matrices are converted to coo first.

tflon.data.feeds.convert_to_feedable(X)¶

Converts input to a type valid for input to the tensorflow.Session.run feed_dict argument

Parameters:	X – A feedable type (valid types: tflon.data.Table, numpy.ndarray, scipy.sparse.csr_matrix, scipy.sparse.coo.coo_matrix, tf.Tensor, tf.SparseTensorValue)
Returns:	np.array, tf.Tensor, or tf.SparseTensorValue
Raises:	`ValueError` – If X cannot be converted to a feedable type

Schema definitions¶

class tflon.data.schema.Schema(schema)¶

Wrapper for schema specifications returned by tflon.model.Model.schema

Parameters: schema (dict) – dictionary of str -> (filenames, type[, reader]). Keys are string table names corresponding to model inputs, filenames is a string or list of strings specifying the names of files to be loaded from the shard directory, type is a class inheriting from Table, reader (optional) is a function with signature reader(filename) -> pandas.DataFrame (default = tflon.data.read_parquet).

load(*directories)¶

Load tables from a shard using this schema.

Keyword Arguments:
Parameters:	directories (str*) – paths to the shard directories containing serialized tables
	reader (callable) – A function used for loading serialized tables. Signature: reader(filename), default: tflon.data.read_parquet
Returns:	Map of name -> Table, the loaded table data
Return type:	dict

Output transformations¶

class tflon.data.output.CopyIndex(template_table, columns=None)¶

Add an index to a 2-D tensor

Parameters:	template_table (str) – The name of the input table used to set the indexes

class tflon.data.output.DenseToNested(template_table, columns=None)¶

Convert a 2-D tensor to a nested tensor, the number of entries in each nested list is decided by a template table

Parameters:	template_table (str) – The name of the input table used to set nest sizes and indexes

tflon.data.output.IdentityTransform¶: alias of tflon.data.output.CopyIndex

class tflon.data.output.TensorTransform¶: Interface class for output tensor transformations.