1*da0073e9SAndroid Build Coastguard Worker:orphan: 2*da0073e9SAndroid Build Coastguard Worker 3*da0073e9SAndroid Build Coastguard Worker.. _remote-reference-protocol: 4*da0073e9SAndroid Build Coastguard Worker 5*da0073e9SAndroid Build Coastguard WorkerRemote Reference Protocol 6*da0073e9SAndroid Build Coastguard Worker========================= 7*da0073e9SAndroid Build Coastguard Worker 8*da0073e9SAndroid Build Coastguard WorkerThis note describes the design details of Remote Reference protocol and walks 9*da0073e9SAndroid Build Coastguard Workerthrough message flows in different scenarios. Make sure you're familiar with the 10*da0073e9SAndroid Build Coastguard Worker:ref:`distributed-rpc-framework` before proceeding. 11*da0073e9SAndroid Build Coastguard Worker 12*da0073e9SAndroid Build Coastguard WorkerBackground 13*da0073e9SAndroid Build Coastguard Worker^^^^^^^^^^ 14*da0073e9SAndroid Build Coastguard Worker 15*da0073e9SAndroid Build Coastguard WorkerRRef stands for Remote REFerence. It is a reference of an object which is 16*da0073e9SAndroid Build Coastguard Workerlocated on the local or remote worker, and transparently handles reference 17*da0073e9SAndroid Build Coastguard Workercounting under the hood. Conceptually, it can be considered as a distributed 18*da0073e9SAndroid Build Coastguard Workershared pointer. Applications can create an RRef by calling 19*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.remote`. Each RRef is owned by the callee worker 20*da0073e9SAndroid Build Coastguard Workerof the :meth:`~torch.distributed.rpc.remote` call (i.e., owner) and can be used 21*da0073e9SAndroid Build Coastguard Workerby multiple users. The owner stores the real data and keeps track of the global 22*da0073e9SAndroid Build Coastguard Workerreference count. Every RRef can be uniquely identified by a global ``RRefId``, 23*da0073e9SAndroid Build Coastguard Workerwhich is assigned at the time of creation on the caller of the 24*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.remote` call. 25*da0073e9SAndroid Build Coastguard Worker 26*da0073e9SAndroid Build Coastguard WorkerOn the owner worker, there is only one ``OwnerRRef`` instance, which contains 27*da0073e9SAndroid Build Coastguard Workerthe real data, while on user workers, there can be as many ``UserRRefs`` as 28*da0073e9SAndroid Build Coastguard Workernecessary, and ``UserRRef`` does not hold the data. All usage on the owner will 29*da0073e9SAndroid Build Coastguard Workerretrieve the unique ``OwnerRRef`` instance using the globally unique ``RRefId``. 30*da0073e9SAndroid Build Coastguard WorkerA ``UserRRef`` will be created when it is used as an argument or return value in 31*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.rpc_sync`, 32*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.rpc_async` or 33*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.remote` invocation, and the owner will be notified 34*da0073e9SAndroid Build Coastguard Workeraccording to update the reference count. An ``OwnerRRef`` and its data will be 35*da0073e9SAndroid Build Coastguard Workerdeleted when there is no ``UserRRef`` instances globally and there are no 36*da0073e9SAndroid Build Coastguard Workerreference to the ``OwnerRRef`` on the owner as well. 37*da0073e9SAndroid Build Coastguard Worker 38*da0073e9SAndroid Build Coastguard Worker 39*da0073e9SAndroid Build Coastguard WorkerAssumptions 40*da0073e9SAndroid Build Coastguard Worker^^^^^^^^^^^ 41*da0073e9SAndroid Build Coastguard Worker 42*da0073e9SAndroid Build Coastguard WorkerRRef protocol is designed with the following assumptions. 43*da0073e9SAndroid Build Coastguard Worker 44*da0073e9SAndroid Build Coastguard Worker- **Transient Network Failures**: The RRef design handles transient 45*da0073e9SAndroid Build Coastguard Worker network failures by retrying messages. It cannot handle node crashes or 46*da0073e9SAndroid Build Coastguard Worker permanent network partitions. When those incidents occur, the application 47*da0073e9SAndroid Build Coastguard Worker should take down all workers, revert to the previous checkpoint, and resume 48*da0073e9SAndroid Build Coastguard Worker training. 49*da0073e9SAndroid Build Coastguard Worker- **Non-idempotent UDFs**: We assume the user functions (UDF) provided to 50*da0073e9SAndroid Build Coastguard Worker :meth:`~torch.distributed.rpc.rpc_sync`, 51*da0073e9SAndroid Build Coastguard Worker :meth:`~torch.distributed.rpc.rpc_async` or 52*da0073e9SAndroid Build Coastguard Worker :meth:`~torch.distributed.rpc.remote` are not idempotent and therefore 53*da0073e9SAndroid Build Coastguard Worker cannot be retried. However, internal RRef control messages are idempotent and 54*da0073e9SAndroid Build Coastguard Worker retried upon message failure. 55*da0073e9SAndroid Build Coastguard Worker- **Out of Order Message Delivery**: We do not assume message delivery order 56*da0073e9SAndroid Build Coastguard Worker between any pair of nodes, because both sender and receiver are using multiple 57*da0073e9SAndroid Build Coastguard Worker threads. There is no guarantee on which message will be processed first. 58*da0073e9SAndroid Build Coastguard Worker 59*da0073e9SAndroid Build Coastguard Worker 60*da0073e9SAndroid Build Coastguard WorkerRRef Lifetime 61*da0073e9SAndroid Build Coastguard Worker^^^^^^^^^^^^^ 62*da0073e9SAndroid Build Coastguard Worker 63*da0073e9SAndroid Build Coastguard WorkerThe goal of the protocol is to delete an ``OwnerRRef`` at an appropriate time. 64*da0073e9SAndroid Build Coastguard WorkerThe right time to delete an ``OwnerRRef`` is when there are no living 65*da0073e9SAndroid Build Coastguard Worker``UserRRef`` instances and user code is not holding references to the 66*da0073e9SAndroid Build Coastguard Worker``OwnerRRef`` either. The tricky part is to determine if there are any living 67*da0073e9SAndroid Build Coastguard Worker``UserRRef`` instances. 68*da0073e9SAndroid Build Coastguard Worker 69*da0073e9SAndroid Build Coastguard WorkerDesign Reasoning 70*da0073e9SAndroid Build Coastguard Worker---------------- 71*da0073e9SAndroid Build Coastguard Worker 72*da0073e9SAndroid Build Coastguard WorkerA user can get a ``UserRRef`` in three situations: 73*da0073e9SAndroid Build Coastguard Worker 74*da0073e9SAndroid Build Coastguard Worker1) Receiving a ``UserRRef`` from the owner. 75*da0073e9SAndroid Build Coastguard Worker2) Receiving a ``UserRRef`` from another user. 76*da0073e9SAndroid Build Coastguard Worker3) Creating a new ``UserRRef`` owned by another worker. 77*da0073e9SAndroid Build Coastguard Worker 78*da0073e9SAndroid Build Coastguard Worker 79*da0073e9SAndroid Build Coastguard WorkerCase 1 is the simplest where the owner passes its RRef to a user, where the 80*da0073e9SAndroid Build Coastguard Workerowner calls :meth:`~torch.distributed.rpc.rpc_sync`, 81*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.rpc_async`, or 82*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.remote` and uses its RRef as an argument. In this 83*da0073e9SAndroid Build Coastguard Workercase a new ``UserRRef`` will be created on the user. As the owner is the caller, 84*da0073e9SAndroid Build Coastguard Workerit can easily update its local reference count on the ``OwnerRRef``. 85*da0073e9SAndroid Build Coastguard Worker 86*da0073e9SAndroid Build Coastguard WorkerThe only requirement is that any 87*da0073e9SAndroid Build Coastguard Worker``UserRRef`` must notify the owner upon destruction. Hence, we need the first 88*da0073e9SAndroid Build Coastguard Workerguarantee: 89*da0073e9SAndroid Build Coastguard Worker 90*da0073e9SAndroid Build Coastguard Worker**G1. The owner will be notified when any UserRRef is deleted.** 91*da0073e9SAndroid Build Coastguard Worker 92*da0073e9SAndroid Build Coastguard WorkerAs messages might come delayed or out-of-order, we need one more guarantee to 93*da0073e9SAndroid Build Coastguard Workermake sure the delete message is not processed too soon. If A sends a message to 94*da0073e9SAndroid Build Coastguard WorkerB that involves an RRef, we call the RRef on A (the parent RRef) and the RRef on B 95*da0073e9SAndroid Build Coastguard Worker(the child RRef). 96*da0073e9SAndroid Build Coastguard Worker 97*da0073e9SAndroid Build Coastguard Worker**G2. Parent RRef will NOT be deleted until the child RRef is confirmed by the 98*da0073e9SAndroid Build Coastguard Workerowner.** 99*da0073e9SAndroid Build Coastguard Worker 100*da0073e9SAndroid Build Coastguard WorkerIn cases 2 and 3, it is possible that the owner has only partial or no knowledge 101*da0073e9SAndroid Build Coastguard Workerat all about the RRef fork graph. For example, an RRef could be 102*da0073e9SAndroid Build Coastguard Workerconstructed on a user, and before the owner receives any RPC call, the 103*da0073e9SAndroid Build Coastguard Workercreator user might have already shared the RRef with other users, and those 104*da0073e9SAndroid Build Coastguard Workerusers could further share the RRef. One invariant is that the fork graph of 105*da0073e9SAndroid Build Coastguard Workerany RRef is always a tree, because forking an RRef always 106*da0073e9SAndroid Build Coastguard Workercreates a new ``UserRRef`` instance on the callee (except if the callee is the 107*da0073e9SAndroid Build Coastguard Workerowner), and hence every RRef has a single parent. 108*da0073e9SAndroid Build Coastguard Worker 109*da0073e9SAndroid Build Coastguard WorkerThe owner's view on any ``UserRRef`` in the tree has three stages: 110*da0073e9SAndroid Build Coastguard Worker 111*da0073e9SAndroid Build Coastguard Worker.. code:: 112*da0073e9SAndroid Build Coastguard Worker 113*da0073e9SAndroid Build Coastguard Worker 1) unknown -> 2) known -> 3) deleted. 114*da0073e9SAndroid Build Coastguard Worker 115*da0073e9SAndroid Build Coastguard WorkerThe owner's view of the entire tree keeps changing. The owner deletes its 116*da0073e9SAndroid Build Coastguard Worker``OwnerRRef`` instance when it thinks there are no living ``UserRRef`` 117*da0073e9SAndroid Build Coastguard Workerinstances, i.e., 118*da0073e9SAndroid Build Coastguard Workerwhen ``OwnerRRef`` is deleted, all ``UserRRef`` instances could be either indeed 119*da0073e9SAndroid Build Coastguard Workerdeleted or unknown. The dangerous case is when some forks are unknown and others 120*da0073e9SAndroid Build Coastguard Workerare deleted. 121*da0073e9SAndroid Build Coastguard Worker 122*da0073e9SAndroid Build Coastguard Worker**G2** trivially guarantees that no parent ``UserRRef`` can be deleted before 123*da0073e9SAndroid Build Coastguard Workerthe owner knows all of its children ``UserRRef`` instances. However, it is 124*da0073e9SAndroid Build Coastguard Workerpossible that the child ``UserRRef`` may be deleted before the owner knows its 125*da0073e9SAndroid Build Coastguard Workerparent ``UserRRef``. 126*da0073e9SAndroid Build Coastguard Worker 127*da0073e9SAndroid Build Coastguard WorkerConsider the following example, where the ``OwnerRRef`` forks to A, then A forks 128*da0073e9SAndroid Build Coastguard Workerto Y, and Y forks to Z: 129*da0073e9SAndroid Build Coastguard Worker 130*da0073e9SAndroid Build Coastguard Worker.. code:: 131*da0073e9SAndroid Build Coastguard Worker 132*da0073e9SAndroid Build Coastguard Worker OwnerRRef -> A -> Y -> Z 133*da0073e9SAndroid Build Coastguard Worker 134*da0073e9SAndroid Build Coastguard WorkerIf all of Z's messages, including the delete message, are processed by the 135*da0073e9SAndroid Build Coastguard Workerowner before Y's messages. the owner will learn of Z's deletion before 136*da0073e9SAndroid Build Coastguard Workerknowing Y exists. Nevertheless, this does not cause any problem. Because, at least 137*da0073e9SAndroid Build Coastguard Workerone of Y's ancestors will be alive (A) and it will 138*da0073e9SAndroid Build Coastguard Workerprevent the owner from deleting the ``OwnerRRef``. More specifically, if the 139*da0073e9SAndroid Build Coastguard Workerowner does not know Y, A cannot be deleted due to **G2**, and the owner knows A 140*da0073e9SAndroid Build Coastguard Workersince it is A's parent. 141*da0073e9SAndroid Build Coastguard Worker 142*da0073e9SAndroid Build Coastguard WorkerThings get a little trickier if the RRef is created on a user: 143*da0073e9SAndroid Build Coastguard Worker 144*da0073e9SAndroid Build Coastguard Worker 145*da0073e9SAndroid Build Coastguard Worker.. code:: 146*da0073e9SAndroid Build Coastguard Worker 147*da0073e9SAndroid Build Coastguard Worker OwnerRRef 148*da0073e9SAndroid Build Coastguard Worker ^ 149*da0073e9SAndroid Build Coastguard Worker | 150*da0073e9SAndroid Build Coastguard Worker A -> Y -> Z 151*da0073e9SAndroid Build Coastguard Worker 152*da0073e9SAndroid Build Coastguard Worker 153*da0073e9SAndroid Build Coastguard WorkerIf Z calls :meth:`~torch.distributed.rpc.RRef.to_here` on the ``UserRRef``, the 154*da0073e9SAndroid Build Coastguard Workerowner at least knows A when Z is deleted, because otherwise, 155*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.RRef.to_here` wouldn't finish. If Z does not call 156*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.RRef.to_here`, it is possible that the owner 157*da0073e9SAndroid Build Coastguard Workerreceives all messages from Z before any message from A and Y. In this case, as 158*da0073e9SAndroid Build Coastguard Workerthe real data of the ``OwnerRRef`` has not been created yet, there is nothing to 159*da0073e9SAndroid Build Coastguard Workerbe deleted either. It is the same as Z does not exist at all. Hence, it's still 160*da0073e9SAndroid Build Coastguard WorkerOK. 161*da0073e9SAndroid Build Coastguard Worker 162*da0073e9SAndroid Build Coastguard WorkerImplementation 163*da0073e9SAndroid Build Coastguard Worker-------------- 164*da0073e9SAndroid Build Coastguard Worker 165*da0073e9SAndroid Build Coastguard Worker**G1** is implemented by sending out a delete message in ``UserRRef`` 166*da0073e9SAndroid Build Coastguard Workerdestructor. To provide **G2**, the parent ``UserRRef`` is put into a context 167*da0073e9SAndroid Build Coastguard Workerwhenever it is forked, indexed by the new ``ForkId``. The parent ``UserRRef`` is 168*da0073e9SAndroid Build Coastguard Workeronly removed from the context when it receives an acknowledgement message (ACK) 169*da0073e9SAndroid Build Coastguard Workerfrom the child, and the child will only send out the ACK when it is confirmed by 170*da0073e9SAndroid Build Coastguard Workerthe owner. 171*da0073e9SAndroid Build Coastguard Worker 172*da0073e9SAndroid Build Coastguard Worker 173*da0073e9SAndroid Build Coastguard WorkerProtocol Scenarios 174*da0073e9SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^ 175*da0073e9SAndroid Build Coastguard Worker 176*da0073e9SAndroid Build Coastguard WorkerLet's now discuss how the above designs translate to the protocol in four 177*da0073e9SAndroid Build Coastguard Workerscenarios. 178*da0073e9SAndroid Build Coastguard Worker 179*da0073e9SAndroid Build Coastguard WorkerUser Share RRef with Owner as Return Value 180*da0073e9SAndroid Build Coastguard Worker------------------------------------------ 181*da0073e9SAndroid Build Coastguard Worker 182*da0073e9SAndroid Build Coastguard Worker 183*da0073e9SAndroid Build Coastguard Worker.. code:: 184*da0073e9SAndroid Build Coastguard Worker 185*da0073e9SAndroid Build Coastguard Worker import torch 186*da0073e9SAndroid Build Coastguard Worker import torch.distributed.rpc as rpc 187*da0073e9SAndroid Build Coastguard Worker 188*da0073e9SAndroid Build Coastguard Worker # on worker A 189*da0073e9SAndroid Build Coastguard Worker rref = rpc.remote('B', torch.add, args=(torch.ones(2), 1)) 190*da0073e9SAndroid Build Coastguard Worker # say the rref has RRefId 100 and ForkId 1 191*da0073e9SAndroid Build Coastguard Worker rref.to_here() 192*da0073e9SAndroid Build Coastguard Worker 193*da0073e9SAndroid Build Coastguard Worker 194*da0073e9SAndroid Build Coastguard WorkerIn this case, the ``UserRRef`` is created on the user worker A, then it is 195*da0073e9SAndroid Build Coastguard Workerpassed to the owner worker B together with the remote message, and then B 196*da0073e9SAndroid Build Coastguard Workercreates the ``OwnerRRef``. The method :meth:`~torch.distributed.rpc.remote` 197*da0073e9SAndroid Build Coastguard Workerreturns immediately, meaning that the ``UserRRef`` can be forked/used before 198*da0073e9SAndroid Build Coastguard Workerthe owner knows about it. 199*da0073e9SAndroid Build Coastguard Worker 200*da0073e9SAndroid Build Coastguard WorkerOn the owner, when receiving the :meth:`~torch.distributed.rpc.remote` call, it 201*da0073e9SAndroid Build Coastguard Workerwill create the ``OwnerRRef``, and returns an ACK to acknowledge ``{100, 1}`` 202*da0073e9SAndroid Build Coastguard Worker(``RRefId``, ``ForkId``). Only after receiving this ACK, can A delete its 203*da0073e9SAndroid Build Coastguard Worker``UserRRef``. This involves both **G1** and **G2**. **G1** is obvious. For 204*da0073e9SAndroid Build Coastguard Worker**G2**, the ``OwnerRRef`` is a child of the ``UserRRef``, and the ``UserRRef`` 205*da0073e9SAndroid Build Coastguard Workeris not deleted until it receives the ACK from the owner. 206*da0073e9SAndroid Build Coastguard Worker 207*da0073e9SAndroid Build Coastguard Worker.. image:: https://user-images\.githubusercontent\.com/16999635/69164772-98181300-0abe-11ea-93a7-9ad9f757cd94.png 208*da0073e9SAndroid Build Coastguard Worker :alt: user_to_owner_ret.png 209*da0073e9SAndroid Build Coastguard Worker :width: 500 px 210*da0073e9SAndroid Build Coastguard Worker 211*da0073e9SAndroid Build Coastguard WorkerThe diagram above shows the message flow, where solid arrow contains user 212*da0073e9SAndroid Build Coastguard Workerfunction and dashed arrow are builtin messages. Note that the first two messages 213*da0073e9SAndroid Build Coastguard Workerfrom A to B (:meth:`~torch.distributed.rpc.remote` and 214*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.RRef.to_here`) may 215*da0073e9SAndroid Build Coastguard Workerarrive at B in any order, but the final delete message will only be sent out 216*da0073e9SAndroid Build Coastguard Workerwhen: 217*da0073e9SAndroid Build Coastguard Worker 218*da0073e9SAndroid Build Coastguard Worker- B acknowledges ``UserRRef {100, 1}`` (G2), and 219*da0073e9SAndroid Build Coastguard Worker- Python GC agrees to delete the local ``UserRRef`` instance. This occurs when 220*da0073e9SAndroid Build Coastguard Worker the RRef is no longer in scope and is eligible for garbage collection. 221*da0073e9SAndroid Build Coastguard Worker 222*da0073e9SAndroid Build Coastguard Worker 223*da0073e9SAndroid Build Coastguard Worker 224*da0073e9SAndroid Build Coastguard WorkerUser Share RRef with Owner as Argument 225*da0073e9SAndroid Build Coastguard Worker-------------------------------------- 226*da0073e9SAndroid Build Coastguard Worker 227*da0073e9SAndroid Build Coastguard Worker.. code:: 228*da0073e9SAndroid Build Coastguard Worker 229*da0073e9SAndroid Build Coastguard Worker import torch 230*da0073e9SAndroid Build Coastguard Worker import torch.distributed.rpc as rpc 231*da0073e9SAndroid Build Coastguard Worker 232*da0073e9SAndroid Build Coastguard Worker # on worker A and worker B 233*da0073e9SAndroid Build Coastguard Worker def func(rref): 234*da0073e9SAndroid Build Coastguard Worker pass 235*da0073e9SAndroid Build Coastguard Worker 236*da0073e9SAndroid Build Coastguard Worker # on worker A 237*da0073e9SAndroid Build Coastguard Worker rref = rpc.remote('B', torch.add, args=(torch.ones(2), 1)) 238*da0073e9SAndroid Build Coastguard Worker # say the rref has RRefId 100 and ForkId 1 239*da0073e9SAndroid Build Coastguard Worker rpc.rpc_async('B', func, args=(rref, )) 240*da0073e9SAndroid Build Coastguard Worker 241*da0073e9SAndroid Build Coastguard Worker 242*da0073e9SAndroid Build Coastguard WorkerIn this case, after creating the ``UserRRef`` on A, A uses it as an argument in 243*da0073e9SAndroid Build Coastguard Workera followup RPC call to B. A will keep ``UserRRef {100, 1}`` alive until it 244*da0073e9SAndroid Build Coastguard Workerreceives the acknowledge from B (**G2**, not the return value of the RPC call). 245*da0073e9SAndroid Build Coastguard WorkerThis is necessary because A should not send out the delete message until all 246*da0073e9SAndroid Build Coastguard Workerprevious messages are received, otherwise, the ``OwnerRRef`` could be 247*da0073e9SAndroid Build Coastguard Workerdeleted before usage as we do not guarantee message delivery order. This is done 248*da0073e9SAndroid Build Coastguard Workerby creating a child ``ForkId`` of RRef, holding them in a map until receives the 249*da0073e9SAndroid Build Coastguard Workerowner confirms the child ``ForkId``. The figure below shows the message flow. 250*da0073e9SAndroid Build Coastguard Worker 251*da0073e9SAndroid Build Coastguard Worker.. image:: https://user-images.githubusercontent.com/16999635/69164845-b67e0e80-0abe-11ea-93fa-d24674e75a2b.png 252*da0073e9SAndroid Build Coastguard Worker :alt: user_to_owner_arg.png 253*da0073e9SAndroid Build Coastguard Worker :width: 500 px 254*da0073e9SAndroid Build Coastguard Worker 255*da0073e9SAndroid Build Coastguard Worker 256*da0073e9SAndroid Build Coastguard WorkerNote that the ``UserRRef`` could be deleted on B before func finishes or even 257*da0073e9SAndroid Build Coastguard Workerstarts. However this is OK, as at the time B sends out ACK for the child 258*da0073e9SAndroid Build Coastguard Worker``ForkId``, it already acquired the ``OwnerRRef`` instance, which would prevent 259*da0073e9SAndroid Build Coastguard Workerit been deleted too soon. 260*da0073e9SAndroid Build Coastguard Worker 261*da0073e9SAndroid Build Coastguard Worker 262*da0073e9SAndroid Build Coastguard WorkerOwner Share RRef with User 263*da0073e9SAndroid Build Coastguard Worker-------------------------- 264*da0073e9SAndroid Build Coastguard Worker 265*da0073e9SAndroid Build Coastguard WorkerOwner to user is the simplest case, where the owner can update reference 266*da0073e9SAndroid Build Coastguard Workercounting locally, and does not need any additional control message to notify 267*da0073e9SAndroid Build Coastguard Workerothers. Regarding **G2**, it is same as the parent receives the ACK from the 268*da0073e9SAndroid Build Coastguard Workerowner immediately, as the parent is the owner. 269*da0073e9SAndroid Build Coastguard Worker 270*da0073e9SAndroid Build Coastguard Worker.. code:: 271*da0073e9SAndroid Build Coastguard Worker 272*da0073e9SAndroid Build Coastguard Worker import torch 273*da0073e9SAndroid Build Coastguard Worker import torch.distributed.rpc as RRef, rpc 274*da0073e9SAndroid Build Coastguard Worker 275*da0073e9SAndroid Build Coastguard Worker # on worker B and worker C 276*da0073e9SAndroid Build Coastguard Worker def func(rref): 277*da0073e9SAndroid Build Coastguard Worker pass 278*da0073e9SAndroid Build Coastguard Worker 279*da0073e9SAndroid Build Coastguard Worker # on worker B, creating a local RRef 280*da0073e9SAndroid Build Coastguard Worker rref = RRef("data") 281*da0073e9SAndroid Build Coastguard Worker # say the rref has RRefId 100 282*da0073e9SAndroid Build Coastguard Worker dist.rpc_async('C', func, args=(rref, )) 283*da0073e9SAndroid Build Coastguard Worker 284*da0073e9SAndroid Build Coastguard Worker 285*da0073e9SAndroid Build Coastguard Worker.. image:: https://user-images.githubusercontent.com/16999635/69164921-c990de80-0abe-11ea-9250-d32ad00cf4ae.png 286*da0073e9SAndroid Build Coastguard Worker :alt: owner_to_user.png 287*da0073e9SAndroid Build Coastguard Worker :width: 500 px 288*da0073e9SAndroid Build Coastguard Worker 289*da0073e9SAndroid Build Coastguard WorkerThe figure above shows the message flow. Note that when the ``OwnerRRef`` exits 290*da0073e9SAndroid Build Coastguard Workerscope after the rpc_async call, it will not be deleted, because internally 291*da0073e9SAndroid Build Coastguard Workerthere is a map to hold it alive if there is any known forks, in which case is 292*da0073e9SAndroid Build Coastguard Worker``UserRRef {100, 1}``. (**G2**) 293*da0073e9SAndroid Build Coastguard Worker 294*da0073e9SAndroid Build Coastguard Worker 295*da0073e9SAndroid Build Coastguard WorkerUser Share RRef with User 296*da0073e9SAndroid Build Coastguard Worker------------------------- 297*da0073e9SAndroid Build Coastguard Worker 298*da0073e9SAndroid Build Coastguard WorkerThis is the most complicated case where caller user (parent ``UserRRef``), 299*da0073e9SAndroid Build Coastguard Workercallee user (child ``UserRRef``), and the owner all need to get involved. 300*da0073e9SAndroid Build Coastguard Worker 301*da0073e9SAndroid Build Coastguard Worker.. code:: 302*da0073e9SAndroid Build Coastguard Worker 303*da0073e9SAndroid Build Coastguard Worker import torch 304*da0073e9SAndroid Build Coastguard Worker import torch.distributed.rpc as rpc 305*da0073e9SAndroid Build Coastguard Worker 306*da0073e9SAndroid Build Coastguard Worker # on worker A and worker C 307*da0073e9SAndroid Build Coastguard Worker def func(rref): 308*da0073e9SAndroid Build Coastguard Worker pass 309*da0073e9SAndroid Build Coastguard Worker 310*da0073e9SAndroid Build Coastguard Worker # on worker A 311*da0073e9SAndroid Build Coastguard Worker rref = rpc.remote('B', torch.add, args=(torch.ones(2), 1)) 312*da0073e9SAndroid Build Coastguard Worker # say the rref has RRefId 100 and ForkId 1 313*da0073e9SAndroid Build Coastguard Worker rpc.rpc_async('C', func, args=(rref, )) 314*da0073e9SAndroid Build Coastguard Worker 315*da0073e9SAndroid Build Coastguard Worker.. image:: https://user-images.githubusercontent.com/16999635/69164971-d6adcd80-0abe-11ea-971d-6b7af131f0fd.png 316*da0073e9SAndroid Build Coastguard Worker :alt: user_to_user.png 317*da0073e9SAndroid Build Coastguard Worker :width: 500 px 318*da0073e9SAndroid Build Coastguard Worker 319*da0073e9SAndroid Build Coastguard WorkerWhen C receives the child ``UserRRef`` from A, it sends out a fork request to 320*da0073e9SAndroid Build Coastguard Workerthe owner B. Later, when the B confirms the ``UserRRef`` on C, C will perform 321*da0073e9SAndroid Build Coastguard Workertwo actions in parallel: 1) send out the child ACK to A ,and 2) run the user 322*da0073e9SAndroid Build Coastguard Workerprovided function. During this time, the parent (A) will hold its 323*da0073e9SAndroid Build Coastguard Worker``UserRRef {100, 1}`` alive to achieve **G2**. 324