Distributed API¶
-
class
tflon.distributed.distributed.
DistributedTable
(table)¶ Table wrapper implementing mpi Reduce ops for table data.
-
class
tflon.distributed.distributed.
DistributedTrainer
(optimizer, iterations, **kwargs)¶ This trainer adds horovod DistributedOptimizer wrapper to a tensorflow optimizer, and handles broadcasting initialized model states.
-
tflon.distributed.distributed.
get_rank
()¶ Returns rank (the index for distributed thread)
-
tflon.distributed.distributed.
init_distributed_resources
()¶ Initialize mpi and Horovod for distributed training. This should be called before any tflon or tensorflow calls.
-
tflon.distributed.distributed.
is_master
()¶ Check whether the current process is the rank 0 mpi process.
-
tflon.distributed.distributed.
make_distributed_table_feed
(root, schema, master_table=None, partition_strategy='mod')¶ Load data shards for distributed training.
Parameters: - root (str) – The root directory containing shards. Each shard should be a subdirectory of this directory.
- schema (tflon.data.Schema) – Schema mapping files to named tables as returned by tflon.model.Model.schema
Keyword Arguments: partition_strategy (str) – Strategy for dividing shards among nodes. Supported values include: * ‘mod’: For process rank r and number n, divide shards evenly by distributing every r+n-th shard * ‘all’: Replicate all shards on all processes