Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.8/site-packages/tensorflow/_api/v2/compat/v2/distribute/__init__.py: 100%

31 statements  

« prev     ^ index     » next       coverage.py v7.4.0, created at 2024-01-03 07:57 +0000

1# This file is MACHINE GENERATED! Do not edit. 

2# Generated by: tensorflow/python/tools/api/generator/create_python_api.py script. 

3"""Library for running a computation across multiple devices. 

4 

5The intent of this library is that you can write an algorithm in a stylized way 

6and it will be usable with a variety of different `tf.distribute.Strategy` 

7implementations. Each descendant will implement a different strategy for 

8distributing the algorithm across multiple devices/machines. Furthermore, these 

9changes can be hidden inside the specific layers and other library classes that 

10need special treatment to run in a distributed setting, so that most users' 

11model definition code can run unchanged. The `tf.distribute.Strategy` API works 

12the same way with eager and graph execution. 

13 

14*Guides* 

15 

16* [TensorFlow v2.x](https://www.tensorflow.org/guide/distributed_training) 

17* [TensorFlow v1.x](https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/distribute_strategy.ipynb) 

18 

19*Tutorials* 

20 

21* [Distributed Training Tutorials](https://www.tensorflow.org/tutorials/distribute/) 

22 

23 The tutorials cover how to use `tf.distribute.Strategy` to do distributed 

24 training with native Keras APIs, custom training loops, 

25 and Estimator APIs. They also cover how to save/load model when using 

26 `tf.distribute.Strategy`. 

27 

28*Glossary* 

29 

30* _Data parallelism_ is where we run multiple copies of the model 

31 on different slices of the input data. This is in contrast to 

32 _model parallelism_ where we divide up a single copy of a model 

33 across multiple devices. 

34 Note: we only support data parallelism for now, but 

35 hope to add support for model parallelism in the future. 

36* A _device_ is a CPU or accelerator (e.g. GPUs, TPUs) on some machine that 

37 TensorFlow can run operations on (see e.g. `tf.device`). You may have multiple 

38 devices on a single machine, or be connected to devices on multiple 

39 machines. Devices used to run computations are called _worker devices_. 

40 Devices used to store variables are _parameter devices_. For some strategies, 

41 such as `tf.distribute.MirroredStrategy`, the worker and parameter devices 

42 will be the same (see mirrored variables below). For others they will be 

43 different. For example, `tf.distribute.experimental.CentralStorageStrategy` 

44 puts the variables on a single device (which may be a worker device or may be 

45 the CPU), and `tf.distribute.experimental.ParameterServerStrategy` puts the 

46 variables on separate machines called _parameter servers_ (see below). 

47* A _replica_ is one copy of the model, running on one slice of the 

48 input data. Right now each replica is executed on its own 

49 worker device, but once we add support for model parallelism 

50 a replica may span multiple worker devices. 

51* A _host_ is the CPU device on a machine with worker devices, typically 

52 used for running input pipelines. 

53* A _worker_ is defined to be the physical machine(s) containing the physical 

54 devices (e.g. GPUs, TPUs) on which the replicated computation is executed. A 

55 worker may contain one or more replicas, but contains at least one 

56 replica. Typically one worker will correspond to one machine, but in the case 

57 of very large models with model parallelism, one worker may span multiple 

58 machines. We typically run one input pipeline per worker, feeding all the 

59 replicas on that worker. 

60* _Synchronous_, or more commonly _sync_, training is where the updates from 

61 each replica are aggregated together before updating the model variables. This 

62 is in contrast to _asynchronous_, or _async_ training, where each replica 

63 updates the model variables independently. You may also have replicas 

64 partitioned into groups which are in sync within each group but async between 

65 groups. 

66* _Parameter servers_: These are machines that hold a single copy of 

67 parameters/variables, used by some strategies (right now just 

68 `tf.distribute.experimental.ParameterServerStrategy`). All replicas that want 

69 to operate on a variable retrieve it at the beginning of a step and send an 

70 update to be applied at the end of the step. These can in principle support 

71 either sync or async training, but right now we only have support for async 

72 training with parameter servers. Compare to 

73 `tf.distribute.experimental.CentralStorageStrategy`, which puts all variables 

74 on a single device on the same machine (and does sync training), and 

75 `tf.distribute.MirroredStrategy`, which mirrors variables to multiple devices 

76 (see below). 

77 

78* _Replica context_ vs. _Cross-replica context_ vs _Update context_ 

79 

80 A _replica context_ applies 

81 when you execute the computation function that was called with `strategy.run`. 

82 Conceptually, you're in replica context when executing the computation 

83 function that is being replicated. 

84 

85 An _update context_ is entered in a `tf.distribute.StrategyExtended.update` 

86 call. 

87 

88 An _cross-replica context_ is entered when you enter a `strategy.scope`. This 

89 is useful for calling `tf.distribute.Strategy` methods which operate across 

90 the replicas (like `reduce_to()`). By default you start in a _replica context_ 

91 (the "default single _replica context_") and then some methods can switch you 

92 back and forth. 

93 

94* _Distributed value_: Distributed value is represented by the base class 

95 `tf.distribute.DistributedValues`. `tf.distribute.DistributedValues` is useful 

96 to represent values on multiple devices, and it contains a map from replica id 

97 to values. Two representative types of `tf.distribute.DistributedValues` 

98 are `tf.types.experimental.PerReplica` and `tf.types.experimental.Mirrored` 

99 values. 

100 

101 `PerReplica` values exist on the worker devices, with a different value for 

102 each replica. They are produced by iterating through a distributed dataset 

103 returned by `tf.distribute.Strategy.experimental_distribute_dataset` and 

104 `tf.distribute.Strategy.distribute_datasets_from_function`. They are also the 

105 typical result returned by `tf.distribute.Strategy.run`. 

106 

107 `Mirrored` values are like `PerReplica` values, except we know that the value 

108 on all replicas are the same. `Mirrored` values are kept synchronized by the 

109 distribution strategy in use, while `PerReplica` values are left 

110 unsynchronized. `Mirrored` values typically represent model weights. We can 

111 safely read a `Mirrored` value in a cross-replica context by using the value 

112 on any replica, while PerReplica values can only be read within a replica 

113 context. 

114 

115* _Unwrapping_ and _merging_: Consider calling a function `fn` on multiple 

116 replicas, like `strategy.run(fn, args=[w])` with an 

117 argument `w` that is a `tf.distribute.DistributedValues`. This means `w` will 

118 have a map taking replica id `0` to `w0`, replica id `1` to `w1`, etc. 

119 `strategy.run()` unwraps `w` before calling `fn`, so it calls `fn(w0)` on 

120 device `d0`, `fn(w1)` on device `d1`, etc. It then merges the return 

121 values from `fn()`, which leads to one common object if the returned values 

122 are the same object from every replica, or a `DistributedValues` object 

123 otherwise. 

124 

125* _Reductions_ and _all-reduce_: A _reduction_ is a method of aggregating 

126 multiple values into one value, like "sum" or "mean". If a strategy is doing 

127 sync training, we will perform a reduction on the gradients to a parameter 

128 from all replicas before applying the update. _All-reduce_ is an algorithm for 

129 performing a reduction on values from multiple devices and making the result 

130 available on all of those devices. 

131 

132* _Mirrored variables_: These are variables that are created on multiple 

133 devices, where we keep the variables in sync by applying the same 

134 updates to every copy. Mirrored variables are created with 

135 `tf.Variable(...synchronization=tf.VariableSynchronization.ON_WRITE...)`. 

136 Normally they are only used in synchronous training. 

137 

138* _SyncOnRead variables_ 

139 

140 _SyncOnRead variables_ are created by 

141 `tf.Variable(...synchronization=tf.VariableSynchronization.ON_READ...)`, and 

142 they are created on multiple devices. In replica context, each 

143 component variable on the local replica can perform reads and writes without 

144 synchronization with each other. When the 

145 _SyncOnRead variable_ is read in cross-replica context, the values from 

146 component variables are aggregated and returned. 

147 

148 _SyncOnRead variables_ bring a lot of custom configuration difficulty to the 

149 underlying logic, so we do not encourage users to instantiate and use 

150 _SyncOnRead variable_ on their own. We have mainly used _SyncOnRead 

151 variables_ for use cases such as batch norm and metrics. For performance 

152 reasons, we often don't need to keep these statistics in sync every step and 

153 they can be accumulated on each replica independently. The only time we want 

154 to sync them is reporting or checkpointing, which typically happens in 

155 cross-replica context. _SyncOnRead variables_ are also often used by advanced 

156 users who want to control when variable values are aggregated. For example, 

157 users sometimes want to maintain gradients independently on each replica for a 

158 couple of steps without aggregation. 

159 

160* _Distribute-aware layers_ 

161 

162 Layers are generally called in a replica context, except when defining a 

163 Keras functional model. `tf.distribute.in_cross_replica_context` will let you 

164 determine which case you are in. If in a replica context, 

165 the `tf.distribute.get_replica_context` function will return the default 

166 replica context outside a strategy scope, `None` within a strategy scope, and 

167 a `tf.distribute.ReplicaContext` object inside a strategy scope and within a 

168 `tf.distribute.Strategy.run` function. The `ReplicaContext` object has an 

169 `all_reduce` method for aggregating across all replicas. 

170 

171 

172Note that we provide a default version of `tf.distribute.Strategy` that is 

173used when no other strategy is in scope, that provides the same API with 

174reasonable default behavior. 

175 

176""" 

177 

178import sys as _sys 

179 

180from . import cluster_resolver 

181from . import coordinator 

182from . import experimental 

183from tensorflow.python.distribute.collective_all_reduce_strategy import CollectiveAllReduceStrategy as MultiWorkerMirroredStrategy 

184from tensorflow.python.distribute.cross_device_ops import CrossDeviceOps 

185from tensorflow.python.distribute.cross_device_ops import HierarchicalCopyAllReduce 

186from tensorflow.python.distribute.cross_device_ops import NcclAllReduce 

187from tensorflow.python.distribute.cross_device_ops import ReductionToOneDevice 

188from tensorflow.python.distribute.distribute_lib import InputContext 

189from tensorflow.python.distribute.distribute_lib import InputOptions 

190from tensorflow.python.distribute.distribute_lib import InputReplicationMode 

191from tensorflow.python.distribute.distribute_lib import ReplicaContext 

192from tensorflow.python.distribute.distribute_lib import RunOptions 

193from tensorflow.python.distribute.distribute_lib import Strategy 

194from tensorflow.python.distribute.distribute_lib import StrategyExtendedV2 as StrategyExtended 

195from tensorflow.python.distribute.distribute_lib import experimental_set_strategy 

196from tensorflow.python.distribute.distribute_lib import get_replica_context 

197from tensorflow.python.distribute.distribute_lib import get_strategy 

198from tensorflow.python.distribute.distribute_lib import has_strategy 

199from tensorflow.python.distribute.distribute_lib import in_cross_replica_context 

200from tensorflow.python.distribute.mirrored_strategy import MirroredStrategy 

201from tensorflow.python.distribute.one_device_strategy import OneDeviceStrategy 

202from tensorflow.python.distribute.parameter_server_strategy_v2 import ParameterServerStrategyV2 as ParameterServerStrategy 

203from tensorflow.python.distribute.reduce_util import ReduceOp 

204from tensorflow.python.distribute.tpu_strategy import TPUStrategyV2 as TPUStrategy 

205from tensorflow.python.training.server_lib import Server 

206from tensorflow.python.types.distribute import DistributedDatasetInterface as DistributedDataset 

207from tensorflow.python.types.distribute import DistributedIteratorInterface as DistributedIterator 

208from tensorflow.python.types.distribute import DistributedValues