Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.8/site-packages/tensorflow/_api/v2/distribute/__init_

1# This file is MACHINE GENERATED! Do not edit.

2# Generated by: tensorflow/python/tools/api/generator/create_python_api.py script.

3"""Library for running a computation across multiple devices.

5The intent of this library is that you can write an algorithm in a stylized way

6and it will be usable with a variety of different `tf.distribute.Strategy`

7implementations. Each descendant will implement a different strategy for

8distributing the algorithm across multiple devices/machines. Furthermore, these

9changes can be hidden inside the specific layers and other library classes that

10need special treatment to run in a distributed setting, so that most users'

11model definition code can run unchanged. The `tf.distribute.Strategy` API works

12the same way with eager and graph execution.

14*Guides*

16* [TensorFlow v2.x](https://www.tensorflow.org/guide/distributed_training)

17* [TensorFlow v1.x](https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/distribute_strategy.ipynb)

19*Tutorials*

21* [Distributed Training Tutorials](https://www.tensorflow.org/tutorials/distribute/)

23 The tutorials cover how to use `tf.distribute.Strategy` to do distributed

24 training with native Keras APIs, custom training loops,

25 and Estimator APIs. They also cover how to save/load model when using

26 `tf.distribute.Strategy`.

28*Glossary*

30* _Data parallelism_ is where we run multiple copies of the model

31 on different slices of the input data. This is in contrast to

32 _model parallelism_ where we divide up a single copy of a model

33 across multiple devices.

34 Note: we only support data parallelism for now, but

35 hope to add support for model parallelism in the future.

36* A _device_ is a CPU or accelerator (e.g. GPUs, TPUs) on some machine that

37 TensorFlow can run operations on (see e.g. `tf.device`). You may have multiple

38 devices on a single machine, or be connected to devices on multiple

39 machines. Devices used to run computations are called _worker devices_.

40 Devices used to store variables are _parameter devices_. For some strategies,

41 such as `tf.distribute.MirroredStrategy`, the worker and parameter devices

42 will be the same (see mirrored variables below). For others they will be

43 different. For example, `tf.distribute.experimental.CentralStorageStrategy`

44 puts the variables on a single device (which may be a worker device or may be

45 the CPU), and `tf.distribute.experimental.ParameterServerStrategy` puts the

46 variables on separate machines called _parameter servers_ (see below).

47* A _replica_ is one copy of the model, running on one slice of the

48 input data. Right now each replica is executed on its own

49 worker device, but once we add support for model parallelism

50 a replica may span multiple worker devices.

51* A _host_ is the CPU device on a machine with worker devices, typically

52 used for running input pipelines.

53* A _worker_ is defined to be the physical machine(s) containing the physical

54 devices (e.g. GPUs, TPUs) on which the replicated computation is executed. A

55 worker may contain one or more replicas, but contains at least one

56 replica. Typically one worker will correspond to one machine, but in the case

57 of very large models with model parallelism, one worker may span multiple

58 machines. We typically run one input pipeline per worker, feeding all the

59 replicas on that worker.

60* _Synchronous_, or more commonly _sync_, training is where the updates from

61 each replica are aggregated together before updating the model variables. This

62 is in contrast to _asynchronous_, or _async_ training, where each replica

63 updates the model variables independently. You may also have replicas

64 partitioned into groups which are in sync within each group but async between

65 groups.

66* _Parameter servers_: These are machines that hold a single copy of

67 parameters/variables, used by some strategies (right now just

68 `tf.distribute.experimental.ParameterServerStrategy`). All replicas that want

69 to operate on a variable retrieve it at the beginning of a step and send an

70 update to be applied at the end of the step. These can in principle support

71 either sync or async training, but right now we only have support for async

72 training with parameter servers. Compare to

73 `tf.distribute.experimental.CentralStorageStrategy`, which puts all variables

74 on a single device on the same machine (and does sync training), and

75 `tf.distribute.MirroredStrategy`, which mirrors variables to multiple devices

76 (see below).

78* _Replica context_ vs. _Cross-replica context_ vs _Update context_

80 A _replica context_ applies

81 when you execute the computation function that was called with `strategy.run`.

82 Conceptually, you're in replica context when executing the computation

83 function that is being replicated.

85 An _update context_ is entered in a `tf.distribute.StrategyExtended.update`

86 call.

88 An _cross-replica context_ is entered when you enter a `strategy.scope`. This

89 is useful for calling `tf.distribute.Strategy` methods which operate across

90 the replicas (like `reduce_to()`). By default you start in a _replica context_

91 (the "default single _replica context_") and then some methods can switch you

92 back and forth.

94* _Distributed value_: Distributed value is represented by the base class

95 `tf.distribute.DistributedValues`. `tf.distribute.DistributedValues` is useful

96 to represent values on multiple devices, and it contains a map from replica id

97 to values. Two representative types of `tf.distribute.DistributedValues`

98 are `tf.types.experimental.PerReplica` and `tf.types.experimental.Mirrored`

99 values.

100

101 `PerReplica` values exist on the worker devices, with a different value for

102 each replica. They are produced by iterating through a distributed dataset

103 returned by `tf.distribute.Strategy.experimental_distribute_dataset` and

104 `tf.distribute.Strategy.distribute_datasets_from_function`. They are also the

105 typical result returned by `tf.distribute.Strategy.run`.

106

107 `Mirrored` values are like `PerReplica` values, except we know that the value

108 on all replicas are the same. `Mirrored` values are kept synchronized by the

109 distribution strategy in use, while `PerReplica` values are left

110 unsynchronized. `Mirrored` values typically represent model weights. We can

111 safely read a `Mirrored` value in a cross-replica context by using the value

112 on any replica, while PerReplica values can only be read within a replica

113 context.

114

115* _Unwrapping_ and _merging_: Consider calling a function `fn` on multiple

116 replicas, like `strategy.run(fn, args=[w])` with an

117 argument `w` that is a `tf.distribute.DistributedValues`. This means `w` will

118 have a map taking replica id `0` to `w0`, replica id `1` to `w1`, etc.

119 `strategy.run()` unwraps `w` before calling `fn`, so it calls `fn(w0)` on

120 device `d0`, `fn(w1)` on device `d1`, etc. It then merges the return

121 values from `fn()`, which leads to one common object if the returned values

122 are the same object from every replica, or a `DistributedValues` object

123 otherwise.

124

125* _Reductions_ and _all-reduce_: A _reduction_ is a method of aggregating

126 multiple values into one value, like "sum" or "mean". If a strategy is doing

127 sync training, we will perform a reduction on the gradients to a parameter

128 from all replicas before applying the update. _All-reduce_ is an algorithm for

129 performing a reduction on values from multiple devices and making the result

130 available on all of those devices.

131

132* _Mirrored variables_: These are variables that are created on multiple

133 devices, where we keep the variables in sync by applying the same

134 updates to every copy. Mirrored variables are created with

135 `tf.Variable(...synchronization=tf.VariableSynchronization.ON_WRITE...)`.

136 Normally they are only used in synchronous training.

137

138* _SyncOnRead variables_

139

140 _SyncOnRead variables_ are created by

141 `tf.Variable(...synchronization=tf.VariableSynchronization.ON_READ...)`, and

142 they are created on multiple devices. In replica context, each

143 component variable on the local replica can perform reads and writes without

144 synchronization with each other. When the

145 _SyncOnRead variable_ is read in cross-replica context, the values from

146 component variables are aggregated and returned.

147

148 _SyncOnRead variables_ bring a lot of custom configuration difficulty to the

149 underlying logic, so we do not encourage users to instantiate and use

150 _SyncOnRead variable_ on their own. We have mainly used _SyncOnRead

151 variables_ for use cases such as batch norm and metrics. For performance

152 reasons, we often don't need to keep these statistics in sync every step and

153 they can be accumulated on each replica independently. The only time we want

154 to sync them is reporting or checkpointing, which typically happens in

155 cross-replica context. _SyncOnRead variables_ are also often used by advanced

156 users who want to control when variable values are aggregated. For example,

157 users sometimes want to maintain gradients independently on each replica for a

158 couple of steps without aggregation.

159

160* _Distribute-aware layers_

161

162 Layers are generally called in a replica context, except when defining a

163 Keras functional model. `tf.distribute.in_cross_replica_context` will let you

164 determine which case you are in. If in a replica context,

165 the `tf.distribute.get_replica_context` function will return the default

166 replica context outside a strategy scope, `None` within a strategy scope, and

167 a `tf.distribute.ReplicaContext` object inside a strategy scope and within a

168 `tf.distribute.Strategy.run` function. The `ReplicaContext` object has an

169 `all_reduce` method for aggregating across all replicas.

170

171

172Note that we provide a default version of `tf.distribute.Strategy` that is

173used when no other strategy is in scope, that provides the same API with

174reasonable default behavior.

175

176"""

177

178import sys as _sys

179

180from . import cluster_resolver

181from . import coordinator

182from . import experimental

183from tensorflow.python.distribute.collective_all_reduce_strategy import CollectiveAllReduceStrategy as MultiWorkerMirroredStrategy

184from tensorflow.python.distribute.cross_device_ops import CrossDeviceOps

185from tensorflow.python.distribute.cross_device_ops import HierarchicalCopyAllReduce

186from tensorflow.python.distribute.cross_device_ops import NcclAllReduce

187from tensorflow.python.distribute.cross_device_ops import ReductionToOneDevice

188from tensorflow.python.distribute.distribute_lib import InputContext

189from tensorflow.python.distribute.distribute_lib import InputOptions

190from tensorflow.python.distribute.distribute_lib import InputReplicationMode

191from tensorflow.python.distribute.distribute_lib import ReplicaContext

192from tensorflow.python.distribute.distribute_lib import RunOptions

193from tensorflow.python.distribute.distribute_lib import Strategy

194from tensorflow.python.distribute.distribute_lib import StrategyExtendedV2 as StrategyExtended

195from tensorflow.python.distribute.distribute_lib import experimental_set_strategy

196from tensorflow.python.distribute.distribute_lib import get_replica_context

197from tensorflow.python.distribute.distribute_lib import get_strategy

198from tensorflow.python.distribute.distribute_lib import has_strategy

199from tensorflow.python.distribute.distribute_lib import in_cross_replica_context

200from tensorflow.python.distribute.mirrored_strategy import MirroredStrategy

201from tensorflow.python.distribute.one_device_strategy import OneDeviceStrategy

202from tensorflow.python.distribute.parameter_server_strategy_v2 import ParameterServerStrategyV2 as ParameterServerStrategy

203from tensorflow.python.distribute.reduce_util import ReduceOp

204from tensorflow.python.distribute.tpu_strategy import TPUStrategyV2 as TPUStrategy

205from tensorflow.python.training.server_lib import Server

206from tensorflow.python.types.distribute import DistributedDatasetInterface as DistributedDataset

207from tensorflow.python.types.distribute import DistributedIteratorInterface as DistributedIterator

208from tensorflow.python.types.distribute import DistributedValues

Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.8/site-packages/tensorflow/_api/v2/distribute/init.py: 100%

31 statements