Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.8/site-packages/tensorflow/_api/v2/distribute/__init__.py: 100%
31 statements
« prev ^ index » next coverage.py v7.4.0, created at 2024-01-03 07:57 +0000
« prev ^ index » next coverage.py v7.4.0, created at 2024-01-03 07:57 +0000
1# This file is MACHINE GENERATED! Do not edit.
2# Generated by: tensorflow/python/tools/api/generator/create_python_api.py script.
3"""Library for running a computation across multiple devices.
5The intent of this library is that you can write an algorithm in a stylized way
6and it will be usable with a variety of different `tf.distribute.Strategy`
7implementations. Each descendant will implement a different strategy for
8distributing the algorithm across multiple devices/machines. Furthermore, these
9changes can be hidden inside the specific layers and other library classes that
10need special treatment to run in a distributed setting, so that most users'
11model definition code can run unchanged. The `tf.distribute.Strategy` API works
12the same way with eager and graph execution.
14*Guides*
16* [TensorFlow v2.x](https://www.tensorflow.org/guide/distributed_training)
17* [TensorFlow v1.x](https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/distribute_strategy.ipynb)
19*Tutorials*
21* [Distributed Training Tutorials](https://www.tensorflow.org/tutorials/distribute/)
23 The tutorials cover how to use `tf.distribute.Strategy` to do distributed
24 training with native Keras APIs, custom training loops,
25 and Estimator APIs. They also cover how to save/load model when using
26 `tf.distribute.Strategy`.
28*Glossary*
30* _Data parallelism_ is where we run multiple copies of the model
31 on different slices of the input data. This is in contrast to
32 _model parallelism_ where we divide up a single copy of a model
33 across multiple devices.
34 Note: we only support data parallelism for now, but
35 hope to add support for model parallelism in the future.
36* A _device_ is a CPU or accelerator (e.g. GPUs, TPUs) on some machine that
37 TensorFlow can run operations on (see e.g. `tf.device`). You may have multiple
38 devices on a single machine, or be connected to devices on multiple
39 machines. Devices used to run computations are called _worker devices_.
40 Devices used to store variables are _parameter devices_. For some strategies,
41 such as `tf.distribute.MirroredStrategy`, the worker and parameter devices
42 will be the same (see mirrored variables below). For others they will be
43 different. For example, `tf.distribute.experimental.CentralStorageStrategy`
44 puts the variables on a single device (which may be a worker device or may be
45 the CPU), and `tf.distribute.experimental.ParameterServerStrategy` puts the
46 variables on separate machines called _parameter servers_ (see below).
47* A _replica_ is one copy of the model, running on one slice of the
48 input data. Right now each replica is executed on its own
49 worker device, but once we add support for model parallelism
50 a replica may span multiple worker devices.
51* A _host_ is the CPU device on a machine with worker devices, typically
52 used for running input pipelines.
53* A _worker_ is defined to be the physical machine(s) containing the physical
54 devices (e.g. GPUs, TPUs) on which the replicated computation is executed. A
55 worker may contain one or more replicas, but contains at least one
56 replica. Typically one worker will correspond to one machine, but in the case
57 of very large models with model parallelism, one worker may span multiple
58 machines. We typically run one input pipeline per worker, feeding all the
59 replicas on that worker.
60* _Synchronous_, or more commonly _sync_, training is where the updates from
61 each replica are aggregated together before updating the model variables. This
62 is in contrast to _asynchronous_, or _async_ training, where each replica
63 updates the model variables independently. You may also have replicas
64 partitioned into groups which are in sync within each group but async between
65 groups.
66* _Parameter servers_: These are machines that hold a single copy of
67 parameters/variables, used by some strategies (right now just
68 `tf.distribute.experimental.ParameterServerStrategy`). All replicas that want
69 to operate on a variable retrieve it at the beginning of a step and send an
70 update to be applied at the end of the step. These can in principle support
71 either sync or async training, but right now we only have support for async
72 training with parameter servers. Compare to
73 `tf.distribute.experimental.CentralStorageStrategy`, which puts all variables
74 on a single device on the same machine (and does sync training), and
75 `tf.distribute.MirroredStrategy`, which mirrors variables to multiple devices
76 (see below).
78* _Replica context_ vs. _Cross-replica context_ vs _Update context_
80 A _replica context_ applies
81 when you execute the computation function that was called with `strategy.run`.
82 Conceptually, you're in replica context when executing the computation
83 function that is being replicated.
85 An _update context_ is entered in a `tf.distribute.StrategyExtended.update`
86 call.
88 An _cross-replica context_ is entered when you enter a `strategy.scope`. This
89 is useful for calling `tf.distribute.Strategy` methods which operate across
90 the replicas (like `reduce_to()`). By default you start in a _replica context_
91 (the "default single _replica context_") and then some methods can switch you
92 back and forth.
94* _Distributed value_: Distributed value is represented by the base class
95 `tf.distribute.DistributedValues`. `tf.distribute.DistributedValues` is useful
96 to represent values on multiple devices, and it contains a map from replica id
97 to values. Two representative types of `tf.distribute.DistributedValues`
98 are `tf.types.experimental.PerReplica` and `tf.types.experimental.Mirrored`
99 values.
101 `PerReplica` values exist on the worker devices, with a different value for
102 each replica. They are produced by iterating through a distributed dataset
103 returned by `tf.distribute.Strategy.experimental_distribute_dataset` and
104 `tf.distribute.Strategy.distribute_datasets_from_function`. They are also the
105 typical result returned by `tf.distribute.Strategy.run`.
107 `Mirrored` values are like `PerReplica` values, except we know that the value
108 on all replicas are the same. `Mirrored` values are kept synchronized by the
109 distribution strategy in use, while `PerReplica` values are left
110 unsynchronized. `Mirrored` values typically represent model weights. We can
111 safely read a `Mirrored` value in a cross-replica context by using the value
112 on any replica, while PerReplica values can only be read within a replica
113 context.
115* _Unwrapping_ and _merging_: Consider calling a function `fn` on multiple
116 replicas, like `strategy.run(fn, args=[w])` with an
117 argument `w` that is a `tf.distribute.DistributedValues`. This means `w` will
118 have a map taking replica id `0` to `w0`, replica id `1` to `w1`, etc.
119 `strategy.run()` unwraps `w` before calling `fn`, so it calls `fn(w0)` on
120 device `d0`, `fn(w1)` on device `d1`, etc. It then merges the return
121 values from `fn()`, which leads to one common object if the returned values
122 are the same object from every replica, or a `DistributedValues` object
123 otherwise.
125* _Reductions_ and _all-reduce_: A _reduction_ is a method of aggregating
126 multiple values into one value, like "sum" or "mean". If a strategy is doing
127 sync training, we will perform a reduction on the gradients to a parameter
128 from all replicas before applying the update. _All-reduce_ is an algorithm for
129 performing a reduction on values from multiple devices and making the result
130 available on all of those devices.
132* _Mirrored variables_: These are variables that are created on multiple
133 devices, where we keep the variables in sync by applying the same
134 updates to every copy. Mirrored variables are created with
135 `tf.Variable(...synchronization=tf.VariableSynchronization.ON_WRITE...)`.
136 Normally they are only used in synchronous training.
138* _SyncOnRead variables_
140 _SyncOnRead variables_ are created by
141 `tf.Variable(...synchronization=tf.VariableSynchronization.ON_READ...)`, and
142 they are created on multiple devices. In replica context, each
143 component variable on the local replica can perform reads and writes without
144 synchronization with each other. When the
145 _SyncOnRead variable_ is read in cross-replica context, the values from
146 component variables are aggregated and returned.
148 _SyncOnRead variables_ bring a lot of custom configuration difficulty to the
149 underlying logic, so we do not encourage users to instantiate and use
150 _SyncOnRead variable_ on their own. We have mainly used _SyncOnRead
151 variables_ for use cases such as batch norm and metrics. For performance
152 reasons, we often don't need to keep these statistics in sync every step and
153 they can be accumulated on each replica independently. The only time we want
154 to sync them is reporting or checkpointing, which typically happens in
155 cross-replica context. _SyncOnRead variables_ are also often used by advanced
156 users who want to control when variable values are aggregated. For example,
157 users sometimes want to maintain gradients independently on each replica for a
158 couple of steps without aggregation.
160* _Distribute-aware layers_
162 Layers are generally called in a replica context, except when defining a
163 Keras functional model. `tf.distribute.in_cross_replica_context` will let you
164 determine which case you are in. If in a replica context,
165 the `tf.distribute.get_replica_context` function will return the default
166 replica context outside a strategy scope, `None` within a strategy scope, and
167 a `tf.distribute.ReplicaContext` object inside a strategy scope and within a
168 `tf.distribute.Strategy.run` function. The `ReplicaContext` object has an
169 `all_reduce` method for aggregating across all replicas.
172Note that we provide a default version of `tf.distribute.Strategy` that is
173used when no other strategy is in scope, that provides the same API with
174reasonable default behavior.
176"""
178import sys as _sys
180from . import cluster_resolver
181from . import coordinator
182from . import experimental
183from tensorflow.python.distribute.collective_all_reduce_strategy import CollectiveAllReduceStrategy as MultiWorkerMirroredStrategy
184from tensorflow.python.distribute.cross_device_ops import CrossDeviceOps
185from tensorflow.python.distribute.cross_device_ops import HierarchicalCopyAllReduce
186from tensorflow.python.distribute.cross_device_ops import NcclAllReduce
187from tensorflow.python.distribute.cross_device_ops import ReductionToOneDevice
188from tensorflow.python.distribute.distribute_lib import InputContext
189from tensorflow.python.distribute.distribute_lib import InputOptions
190from tensorflow.python.distribute.distribute_lib import InputReplicationMode
191from tensorflow.python.distribute.distribute_lib import ReplicaContext
192from tensorflow.python.distribute.distribute_lib import RunOptions
193from tensorflow.python.distribute.distribute_lib import Strategy
194from tensorflow.python.distribute.distribute_lib import StrategyExtendedV2 as StrategyExtended
195from tensorflow.python.distribute.distribute_lib import experimental_set_strategy
196from tensorflow.python.distribute.distribute_lib import get_replica_context
197from tensorflow.python.distribute.distribute_lib import get_strategy
198from tensorflow.python.distribute.distribute_lib import has_strategy
199from tensorflow.python.distribute.distribute_lib import in_cross_replica_context
200from tensorflow.python.distribute.mirrored_strategy import MirroredStrategy
201from tensorflow.python.distribute.one_device_strategy import OneDeviceStrategy
202from tensorflow.python.distribute.parameter_server_strategy_v2 import ParameterServerStrategyV2 as ParameterServerStrategy
203from tensorflow.python.distribute.reduce_util import ReduceOp
204from tensorflow.python.distribute.tpu_strategy import TPUStrategyV2 as TPUStrategy
205from tensorflow.python.training.server_lib import Server
206from tensorflow.python.types.distribute import DistributedDatasetInterface as DistributedDataset
207from tensorflow.python.types.distribute import DistributedIteratorInterface as DistributedIterator
208from tensorflow.python.types.distribute import DistributedValues