xref: /aosp_15_r20/external/pytorch/docs/source/rpc/rref.rst (revision da0073e96a02ea20f0ac840b70461e3646d07c45)
1*da0073e9SAndroid Build Coastguard Worker:orphan:
2*da0073e9SAndroid Build Coastguard Worker
3*da0073e9SAndroid Build Coastguard Worker.. _remote-reference-protocol:
4*da0073e9SAndroid Build Coastguard Worker
5*da0073e9SAndroid Build Coastguard WorkerRemote Reference Protocol
6*da0073e9SAndroid Build Coastguard Worker=========================
7*da0073e9SAndroid Build Coastguard Worker
8*da0073e9SAndroid Build Coastguard WorkerThis note describes the design details of Remote Reference protocol and walks
9*da0073e9SAndroid Build Coastguard Workerthrough message flows in different scenarios. Make sure you're familiar with the
10*da0073e9SAndroid Build Coastguard Worker:ref:`distributed-rpc-framework` before proceeding.
11*da0073e9SAndroid Build Coastguard Worker
12*da0073e9SAndroid Build Coastguard WorkerBackground
13*da0073e9SAndroid Build Coastguard Worker^^^^^^^^^^
14*da0073e9SAndroid Build Coastguard Worker
15*da0073e9SAndroid Build Coastguard WorkerRRef stands for Remote REFerence. It is a reference of an object which is
16*da0073e9SAndroid Build Coastguard Workerlocated on the local or remote worker, and transparently handles reference
17*da0073e9SAndroid Build Coastguard Workercounting under the hood. Conceptually, it can be considered as a distributed
18*da0073e9SAndroid Build Coastguard Workershared pointer. Applications can create an RRef by calling
19*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.remote`. Each RRef is owned by the callee worker
20*da0073e9SAndroid Build Coastguard Workerof the :meth:`~torch.distributed.rpc.remote` call (i.e., owner) and can be used
21*da0073e9SAndroid Build Coastguard Workerby multiple users. The owner stores the real data and keeps track of the global
22*da0073e9SAndroid Build Coastguard Workerreference count. Every RRef can be uniquely identified by a global ``RRefId``,
23*da0073e9SAndroid Build Coastguard Workerwhich is assigned at the time of creation on the caller of the
24*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.remote` call.
25*da0073e9SAndroid Build Coastguard Worker
26*da0073e9SAndroid Build Coastguard WorkerOn the owner worker, there is only one ``OwnerRRef`` instance, which contains
27*da0073e9SAndroid Build Coastguard Workerthe real data, while on user workers, there can be as many ``UserRRefs`` as
28*da0073e9SAndroid Build Coastguard Workernecessary, and ``UserRRef`` does not hold the data. All usage on the owner will
29*da0073e9SAndroid Build Coastguard Workerretrieve the unique ``OwnerRRef`` instance using the globally unique ``RRefId``.
30*da0073e9SAndroid Build Coastguard WorkerA ``UserRRef`` will be created when it is used as an argument or return value in
31*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.rpc_sync`,
32*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.rpc_async` or
33*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.remote` invocation, and the owner will be notified
34*da0073e9SAndroid Build Coastguard Workeraccording to update the reference count. An ``OwnerRRef`` and its data will be
35*da0073e9SAndroid Build Coastguard Workerdeleted when there is no ``UserRRef`` instances globally and there are no
36*da0073e9SAndroid Build Coastguard Workerreference to the ``OwnerRRef`` on the owner as well.
37*da0073e9SAndroid Build Coastguard Worker
38*da0073e9SAndroid Build Coastguard Worker
39*da0073e9SAndroid Build Coastguard WorkerAssumptions
40*da0073e9SAndroid Build Coastguard Worker^^^^^^^^^^^
41*da0073e9SAndroid Build Coastguard Worker
42*da0073e9SAndroid Build Coastguard WorkerRRef protocol is designed with the following assumptions.
43*da0073e9SAndroid Build Coastguard Worker
44*da0073e9SAndroid Build Coastguard Worker- **Transient Network Failures**: The RRef design handles transient
45*da0073e9SAndroid Build Coastguard Worker  network failures by retrying messages. It cannot handle node crashes or
46*da0073e9SAndroid Build Coastguard Worker  permanent network partitions. When those incidents occur, the application
47*da0073e9SAndroid Build Coastguard Worker  should take down all workers, revert to the previous checkpoint, and resume
48*da0073e9SAndroid Build Coastguard Worker  training.
49*da0073e9SAndroid Build Coastguard Worker- **Non-idempotent UDFs**: We assume the user functions (UDF) provided to
50*da0073e9SAndroid Build Coastguard Worker  :meth:`~torch.distributed.rpc.rpc_sync`,
51*da0073e9SAndroid Build Coastguard Worker  :meth:`~torch.distributed.rpc.rpc_async` or
52*da0073e9SAndroid Build Coastguard Worker  :meth:`~torch.distributed.rpc.remote` are not idempotent and therefore
53*da0073e9SAndroid Build Coastguard Worker  cannot be retried. However, internal RRef control messages are idempotent and
54*da0073e9SAndroid Build Coastguard Worker  retried upon message failure.
55*da0073e9SAndroid Build Coastguard Worker- **Out of Order Message Delivery**: We do not assume message delivery order
56*da0073e9SAndroid Build Coastguard Worker  between any pair of nodes, because both sender and receiver are using multiple
57*da0073e9SAndroid Build Coastguard Worker  threads. There is no guarantee on which message will be processed first.
58*da0073e9SAndroid Build Coastguard Worker
59*da0073e9SAndroid Build Coastguard Worker
60*da0073e9SAndroid Build Coastguard WorkerRRef Lifetime
61*da0073e9SAndroid Build Coastguard Worker^^^^^^^^^^^^^
62*da0073e9SAndroid Build Coastguard Worker
63*da0073e9SAndroid Build Coastguard WorkerThe goal of the protocol is to delete an ``OwnerRRef`` at an appropriate time.
64*da0073e9SAndroid Build Coastguard WorkerThe right time to delete an ``OwnerRRef`` is when there are no living
65*da0073e9SAndroid Build Coastguard Worker``UserRRef`` instances and user code is not holding references to the
66*da0073e9SAndroid Build Coastguard Worker``OwnerRRef`` either. The tricky part is to determine if there are any living
67*da0073e9SAndroid Build Coastguard Worker``UserRRef`` instances.
68*da0073e9SAndroid Build Coastguard Worker
69*da0073e9SAndroid Build Coastguard WorkerDesign Reasoning
70*da0073e9SAndroid Build Coastguard Worker----------------
71*da0073e9SAndroid Build Coastguard Worker
72*da0073e9SAndroid Build Coastguard WorkerA user can get a ``UserRRef`` in three situations:
73*da0073e9SAndroid Build Coastguard Worker
74*da0073e9SAndroid Build Coastguard Worker1) Receiving a ``UserRRef`` from the owner.
75*da0073e9SAndroid Build Coastguard Worker2) Receiving a ``UserRRef`` from another user.
76*da0073e9SAndroid Build Coastguard Worker3) Creating a new ``UserRRef`` owned by another worker.
77*da0073e9SAndroid Build Coastguard Worker
78*da0073e9SAndroid Build Coastguard Worker
79*da0073e9SAndroid Build Coastguard WorkerCase 1 is the simplest where the owner passes its RRef to a user, where the
80*da0073e9SAndroid Build Coastguard Workerowner calls :meth:`~torch.distributed.rpc.rpc_sync`,
81*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.rpc_async`, or
82*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.remote` and uses its RRef as an argument. In this
83*da0073e9SAndroid Build Coastguard Workercase a new ``UserRRef`` will be created on the user. As the owner is the caller,
84*da0073e9SAndroid Build Coastguard Workerit can easily update its local reference count on the ``OwnerRRef``.
85*da0073e9SAndroid Build Coastguard Worker
86*da0073e9SAndroid Build Coastguard WorkerThe only requirement is that any
87*da0073e9SAndroid Build Coastguard Worker``UserRRef`` must notify the owner upon destruction. Hence, we need the first
88*da0073e9SAndroid Build Coastguard Workerguarantee:
89*da0073e9SAndroid Build Coastguard Worker
90*da0073e9SAndroid Build Coastguard Worker**G1. The owner will be notified when any UserRRef is deleted.**
91*da0073e9SAndroid Build Coastguard Worker
92*da0073e9SAndroid Build Coastguard WorkerAs messages might come delayed or out-of-order, we need one more guarantee to
93*da0073e9SAndroid Build Coastguard Workermake sure the delete message is not processed too soon. If A sends a message to
94*da0073e9SAndroid Build Coastguard WorkerB that involves an RRef, we call the RRef on A (the parent RRef) and the RRef on B
95*da0073e9SAndroid Build Coastguard Worker(the child RRef).
96*da0073e9SAndroid Build Coastguard Worker
97*da0073e9SAndroid Build Coastguard Worker**G2. Parent RRef will NOT be deleted until the child RRef is confirmed by the
98*da0073e9SAndroid Build Coastguard Workerowner.**
99*da0073e9SAndroid Build Coastguard Worker
100*da0073e9SAndroid Build Coastguard WorkerIn cases 2 and 3, it is possible that the owner has only partial or no knowledge
101*da0073e9SAndroid Build Coastguard Workerat all about the RRef fork graph. For example, an RRef could be
102*da0073e9SAndroid Build Coastguard Workerconstructed on a user, and before the owner receives any RPC call, the
103*da0073e9SAndroid Build Coastguard Workercreator user might have already shared the RRef with other users, and those
104*da0073e9SAndroid Build Coastguard Workerusers could further share the RRef. One invariant is that the fork graph of
105*da0073e9SAndroid Build Coastguard Workerany RRef is always a tree, because forking an RRef always
106*da0073e9SAndroid Build Coastguard Workercreates a new ``UserRRef`` instance on the callee (except if the callee is the
107*da0073e9SAndroid Build Coastguard Workerowner), and hence every RRef has a single parent.
108*da0073e9SAndroid Build Coastguard Worker
109*da0073e9SAndroid Build Coastguard WorkerThe owner's view on any ``UserRRef`` in the tree has three stages:
110*da0073e9SAndroid Build Coastguard Worker
111*da0073e9SAndroid Build Coastguard Worker.. code::
112*da0073e9SAndroid Build Coastguard Worker
113*da0073e9SAndroid Build Coastguard Worker  1) unknown -> 2) known -> 3) deleted.
114*da0073e9SAndroid Build Coastguard Worker
115*da0073e9SAndroid Build Coastguard WorkerThe owner's view of the entire tree keeps changing. The owner deletes its
116*da0073e9SAndroid Build Coastguard Worker``OwnerRRef`` instance when it thinks there are no living ``UserRRef``
117*da0073e9SAndroid Build Coastguard Workerinstances, i.e.,
118*da0073e9SAndroid Build Coastguard Workerwhen ``OwnerRRef`` is deleted, all ``UserRRef`` instances could be either indeed
119*da0073e9SAndroid Build Coastguard Workerdeleted or unknown. The dangerous case is when some forks are unknown and others
120*da0073e9SAndroid Build Coastguard Workerare deleted.
121*da0073e9SAndroid Build Coastguard Worker
122*da0073e9SAndroid Build Coastguard Worker**G2** trivially guarantees that no parent ``UserRRef`` can be deleted before
123*da0073e9SAndroid Build Coastguard Workerthe owner knows all of its children ``UserRRef`` instances. However, it is
124*da0073e9SAndroid Build Coastguard Workerpossible that the child ``UserRRef`` may be deleted before the owner knows its
125*da0073e9SAndroid Build Coastguard Workerparent ``UserRRef``.
126*da0073e9SAndroid Build Coastguard Worker
127*da0073e9SAndroid Build Coastguard WorkerConsider the following example, where the ``OwnerRRef`` forks to A, then A forks
128*da0073e9SAndroid Build Coastguard Workerto Y, and Y forks to Z:
129*da0073e9SAndroid Build Coastguard Worker
130*da0073e9SAndroid Build Coastguard Worker.. code::
131*da0073e9SAndroid Build Coastguard Worker
132*da0073e9SAndroid Build Coastguard Worker  OwnerRRef -> A -> Y -> Z
133*da0073e9SAndroid Build Coastguard Worker
134*da0073e9SAndroid Build Coastguard WorkerIf all of Z's messages, including the delete message, are processed by the
135*da0073e9SAndroid Build Coastguard Workerowner before Y's messages. the owner will learn of Z's deletion before
136*da0073e9SAndroid Build Coastguard Workerknowing Y exists. Nevertheless, this does not cause any problem. Because, at least
137*da0073e9SAndroid Build Coastguard Workerone of Y's ancestors will be alive (A) and it will
138*da0073e9SAndroid Build Coastguard Workerprevent the owner from deleting the ``OwnerRRef``. More specifically, if the
139*da0073e9SAndroid Build Coastguard Workerowner does not know Y, A cannot be deleted due to **G2**, and the owner knows A
140*da0073e9SAndroid Build Coastguard Workersince it is A's parent.
141*da0073e9SAndroid Build Coastguard Worker
142*da0073e9SAndroid Build Coastguard WorkerThings get a little trickier if the RRef is created on a user:
143*da0073e9SAndroid Build Coastguard Worker
144*da0073e9SAndroid Build Coastguard Worker
145*da0073e9SAndroid Build Coastguard Worker.. code::
146*da0073e9SAndroid Build Coastguard Worker
147*da0073e9SAndroid Build Coastguard Worker  OwnerRRef
148*da0073e9SAndroid Build Coastguard Worker      ^
149*da0073e9SAndroid Build Coastguard Worker      |
150*da0073e9SAndroid Build Coastguard Worker      A -> Y -> Z
151*da0073e9SAndroid Build Coastguard Worker
152*da0073e9SAndroid Build Coastguard Worker
153*da0073e9SAndroid Build Coastguard WorkerIf Z calls :meth:`~torch.distributed.rpc.RRef.to_here` on the ``UserRRef``, the
154*da0073e9SAndroid Build Coastguard Workerowner at least knows A when Z is deleted, because otherwise,
155*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.RRef.to_here` wouldn't finish. If Z does not call
156*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.RRef.to_here`, it is possible that the owner
157*da0073e9SAndroid Build Coastguard Workerreceives all messages from Z before any message from A and Y. In this case, as
158*da0073e9SAndroid Build Coastguard Workerthe real data of the ``OwnerRRef`` has not been created yet, there is nothing to
159*da0073e9SAndroid Build Coastguard Workerbe deleted either. It is the same as Z does not exist at all. Hence, it's still
160*da0073e9SAndroid Build Coastguard WorkerOK.
161*da0073e9SAndroid Build Coastguard Worker
162*da0073e9SAndroid Build Coastguard WorkerImplementation
163*da0073e9SAndroid Build Coastguard Worker--------------
164*da0073e9SAndroid Build Coastguard Worker
165*da0073e9SAndroid Build Coastguard Worker**G1** is implemented by sending out a delete message in ``UserRRef``
166*da0073e9SAndroid Build Coastguard Workerdestructor. To provide **G2**, the parent ``UserRRef`` is put into a context
167*da0073e9SAndroid Build Coastguard Workerwhenever it is forked, indexed by the new ``ForkId``. The parent ``UserRRef`` is
168*da0073e9SAndroid Build Coastguard Workeronly removed from the context when it receives an acknowledgement message (ACK)
169*da0073e9SAndroid Build Coastguard Workerfrom the child, and the child will only send out the ACK when it is confirmed by
170*da0073e9SAndroid Build Coastguard Workerthe owner.
171*da0073e9SAndroid Build Coastguard Worker
172*da0073e9SAndroid Build Coastguard Worker
173*da0073e9SAndroid Build Coastguard WorkerProtocol Scenarios
174*da0073e9SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^
175*da0073e9SAndroid Build Coastguard Worker
176*da0073e9SAndroid Build Coastguard WorkerLet's now discuss how the above designs translate to the protocol in four
177*da0073e9SAndroid Build Coastguard Workerscenarios.
178*da0073e9SAndroid Build Coastguard Worker
179*da0073e9SAndroid Build Coastguard WorkerUser Share RRef with Owner as Return Value
180*da0073e9SAndroid Build Coastguard Worker------------------------------------------
181*da0073e9SAndroid Build Coastguard Worker
182*da0073e9SAndroid Build Coastguard Worker
183*da0073e9SAndroid Build Coastguard Worker.. code::
184*da0073e9SAndroid Build Coastguard Worker
185*da0073e9SAndroid Build Coastguard Worker  import torch
186*da0073e9SAndroid Build Coastguard Worker  import torch.distributed.rpc as rpc
187*da0073e9SAndroid Build Coastguard Worker
188*da0073e9SAndroid Build Coastguard Worker  # on worker A
189*da0073e9SAndroid Build Coastguard Worker  rref = rpc.remote('B', torch.add, args=(torch.ones(2), 1))
190*da0073e9SAndroid Build Coastguard Worker  # say the rref has RRefId 100 and ForkId 1
191*da0073e9SAndroid Build Coastguard Worker  rref.to_here()
192*da0073e9SAndroid Build Coastguard Worker
193*da0073e9SAndroid Build Coastguard Worker
194*da0073e9SAndroid Build Coastguard WorkerIn this case, the ``UserRRef`` is created on the user worker A, then it is
195*da0073e9SAndroid Build Coastguard Workerpassed to the owner worker B together with the remote message, and then B
196*da0073e9SAndroid Build Coastguard Workercreates the ``OwnerRRef``. The method :meth:`~torch.distributed.rpc.remote`
197*da0073e9SAndroid Build Coastguard Workerreturns immediately, meaning that the ``UserRRef`` can be forked/used before
198*da0073e9SAndroid Build Coastguard Workerthe owner knows about it.
199*da0073e9SAndroid Build Coastguard Worker
200*da0073e9SAndroid Build Coastguard WorkerOn the owner, when receiving the :meth:`~torch.distributed.rpc.remote` call, it
201*da0073e9SAndroid Build Coastguard Workerwill create the ``OwnerRRef``, and returns an ACK to acknowledge ``{100, 1}``
202*da0073e9SAndroid Build Coastguard Worker(``RRefId``, ``ForkId``). Only after receiving this ACK, can A delete its
203*da0073e9SAndroid Build Coastguard Worker``UserRRef``. This involves both **G1** and **G2**. **G1** is obvious. For
204*da0073e9SAndroid Build Coastguard Worker**G2**, the ``OwnerRRef`` is a child of the ``UserRRef``, and the ``UserRRef``
205*da0073e9SAndroid Build Coastguard Workeris not deleted until it receives the ACK from the owner.
206*da0073e9SAndroid Build Coastguard Worker
207*da0073e9SAndroid Build Coastguard Worker.. image:: https://user-images\.githubusercontent\.com/16999635/69164772-98181300-0abe-11ea-93a7-9ad9f757cd94.png
208*da0073e9SAndroid Build Coastguard Worker    :alt: user_to_owner_ret.png
209*da0073e9SAndroid Build Coastguard Worker    :width: 500 px
210*da0073e9SAndroid Build Coastguard Worker
211*da0073e9SAndroid Build Coastguard WorkerThe diagram above shows the message flow, where solid arrow contains user
212*da0073e9SAndroid Build Coastguard Workerfunction and dashed arrow are builtin messages. Note that the first two messages
213*da0073e9SAndroid Build Coastguard Workerfrom A to B (:meth:`~torch.distributed.rpc.remote` and
214*da0073e9SAndroid Build Coastguard Worker:meth:`~torch.distributed.rpc.RRef.to_here`) may
215*da0073e9SAndroid Build Coastguard Workerarrive at B in any order, but the final delete message will only be sent out
216*da0073e9SAndroid Build Coastguard Workerwhen:
217*da0073e9SAndroid Build Coastguard Worker
218*da0073e9SAndroid Build Coastguard Worker- B acknowledges ``UserRRef {100, 1}`` (G2), and
219*da0073e9SAndroid Build Coastguard Worker- Python GC agrees to delete the local ``UserRRef`` instance. This occurs when
220*da0073e9SAndroid Build Coastguard Worker  the RRef is no longer in scope and is eligible for garbage collection.
221*da0073e9SAndroid Build Coastguard Worker
222*da0073e9SAndroid Build Coastguard Worker
223*da0073e9SAndroid Build Coastguard Worker
224*da0073e9SAndroid Build Coastguard WorkerUser Share RRef with Owner as Argument
225*da0073e9SAndroid Build Coastguard Worker--------------------------------------
226*da0073e9SAndroid Build Coastguard Worker
227*da0073e9SAndroid Build Coastguard Worker.. code::
228*da0073e9SAndroid Build Coastguard Worker
229*da0073e9SAndroid Build Coastguard Worker  import torch
230*da0073e9SAndroid Build Coastguard Worker  import torch.distributed.rpc as rpc
231*da0073e9SAndroid Build Coastguard Worker
232*da0073e9SAndroid Build Coastguard Worker  # on worker A and worker B
233*da0073e9SAndroid Build Coastguard Worker  def func(rref):
234*da0073e9SAndroid Build Coastguard Worker    pass
235*da0073e9SAndroid Build Coastguard Worker
236*da0073e9SAndroid Build Coastguard Worker  # on worker A
237*da0073e9SAndroid Build Coastguard Worker  rref = rpc.remote('B', torch.add, args=(torch.ones(2), 1))
238*da0073e9SAndroid Build Coastguard Worker  # say the rref has RRefId 100 and ForkId 1
239*da0073e9SAndroid Build Coastguard Worker  rpc.rpc_async('B', func, args=(rref, ))
240*da0073e9SAndroid Build Coastguard Worker
241*da0073e9SAndroid Build Coastguard Worker
242*da0073e9SAndroid Build Coastguard WorkerIn this case, after creating the ``UserRRef`` on A, A uses it as an argument in
243*da0073e9SAndroid Build Coastguard Workera followup RPC call to B. A will keep ``UserRRef {100, 1}`` alive until it
244*da0073e9SAndroid Build Coastguard Workerreceives the acknowledge from B (**G2**, not the return value of the RPC call).
245*da0073e9SAndroid Build Coastguard WorkerThis is necessary because A should not send out the delete message until all
246*da0073e9SAndroid Build Coastguard Workerprevious messages are received, otherwise, the ``OwnerRRef`` could be
247*da0073e9SAndroid Build Coastguard Workerdeleted before usage as we do not guarantee message delivery order. This is done
248*da0073e9SAndroid Build Coastguard Workerby creating a child ``ForkId`` of RRef, holding them in a map until receives the
249*da0073e9SAndroid Build Coastguard Workerowner confirms the child ``ForkId``. The figure below shows the message flow.
250*da0073e9SAndroid Build Coastguard Worker
251*da0073e9SAndroid Build Coastguard Worker.. image:: https://user-images.githubusercontent.com/16999635/69164845-b67e0e80-0abe-11ea-93fa-d24674e75a2b.png
252*da0073e9SAndroid Build Coastguard Worker    :alt: user_to_owner_arg.png
253*da0073e9SAndroid Build Coastguard Worker    :width: 500 px
254*da0073e9SAndroid Build Coastguard Worker
255*da0073e9SAndroid Build Coastguard Worker
256*da0073e9SAndroid Build Coastguard WorkerNote that the ``UserRRef`` could be deleted on B before func finishes or even
257*da0073e9SAndroid Build Coastguard Workerstarts. However this is OK, as at the time B sends out ACK for the child
258*da0073e9SAndroid Build Coastguard Worker``ForkId``, it already acquired the ``OwnerRRef`` instance, which would prevent
259*da0073e9SAndroid Build Coastguard Workerit been deleted too soon.
260*da0073e9SAndroid Build Coastguard Worker
261*da0073e9SAndroid Build Coastguard Worker
262*da0073e9SAndroid Build Coastguard WorkerOwner Share RRef with User
263*da0073e9SAndroid Build Coastguard Worker--------------------------
264*da0073e9SAndroid Build Coastguard Worker
265*da0073e9SAndroid Build Coastguard WorkerOwner to user is the simplest case, where the owner can update reference
266*da0073e9SAndroid Build Coastguard Workercounting locally, and does not need any additional control message to notify
267*da0073e9SAndroid Build Coastguard Workerothers. Regarding **G2**, it is same as the parent receives the ACK from the
268*da0073e9SAndroid Build Coastguard Workerowner immediately, as the parent is the owner.
269*da0073e9SAndroid Build Coastguard Worker
270*da0073e9SAndroid Build Coastguard Worker.. code::
271*da0073e9SAndroid Build Coastguard Worker
272*da0073e9SAndroid Build Coastguard Worker  import torch
273*da0073e9SAndroid Build Coastguard Worker  import torch.distributed.rpc as RRef, rpc
274*da0073e9SAndroid Build Coastguard Worker
275*da0073e9SAndroid Build Coastguard Worker  # on worker B and worker C
276*da0073e9SAndroid Build Coastguard Worker  def func(rref):
277*da0073e9SAndroid Build Coastguard Worker    pass
278*da0073e9SAndroid Build Coastguard Worker
279*da0073e9SAndroid Build Coastguard Worker  # on worker B, creating a local RRef
280*da0073e9SAndroid Build Coastguard Worker  rref = RRef("data")
281*da0073e9SAndroid Build Coastguard Worker  # say the rref has RRefId 100
282*da0073e9SAndroid Build Coastguard Worker  dist.rpc_async('C', func, args=(rref, ))
283*da0073e9SAndroid Build Coastguard Worker
284*da0073e9SAndroid Build Coastguard Worker
285*da0073e9SAndroid Build Coastguard Worker.. image:: https://user-images.githubusercontent.com/16999635/69164921-c990de80-0abe-11ea-9250-d32ad00cf4ae.png
286*da0073e9SAndroid Build Coastguard Worker    :alt: owner_to_user.png
287*da0073e9SAndroid Build Coastguard Worker    :width: 500 px
288*da0073e9SAndroid Build Coastguard Worker
289*da0073e9SAndroid Build Coastguard WorkerThe figure above shows the message flow. Note that when the ``OwnerRRef`` exits
290*da0073e9SAndroid Build Coastguard Workerscope after the rpc_async call, it will not be deleted, because internally
291*da0073e9SAndroid Build Coastguard Workerthere is a map to hold it alive if there is any known forks, in which case is
292*da0073e9SAndroid Build Coastguard Worker``UserRRef {100, 1}``. (**G2**)
293*da0073e9SAndroid Build Coastguard Worker
294*da0073e9SAndroid Build Coastguard Worker
295*da0073e9SAndroid Build Coastguard WorkerUser Share RRef with User
296*da0073e9SAndroid Build Coastguard Worker-------------------------
297*da0073e9SAndroid Build Coastguard Worker
298*da0073e9SAndroid Build Coastguard WorkerThis is the most complicated case where caller user (parent ``UserRRef``),
299*da0073e9SAndroid Build Coastguard Workercallee user (child ``UserRRef``), and the owner all need to get involved.
300*da0073e9SAndroid Build Coastguard Worker
301*da0073e9SAndroid Build Coastguard Worker.. code::
302*da0073e9SAndroid Build Coastguard Worker
303*da0073e9SAndroid Build Coastguard Worker  import torch
304*da0073e9SAndroid Build Coastguard Worker  import torch.distributed.rpc as rpc
305*da0073e9SAndroid Build Coastguard Worker
306*da0073e9SAndroid Build Coastguard Worker  # on worker A and worker C
307*da0073e9SAndroid Build Coastguard Worker  def func(rref):
308*da0073e9SAndroid Build Coastguard Worker    pass
309*da0073e9SAndroid Build Coastguard Worker
310*da0073e9SAndroid Build Coastguard Worker  # on worker A
311*da0073e9SAndroid Build Coastguard Worker  rref = rpc.remote('B', torch.add, args=(torch.ones(2), 1))
312*da0073e9SAndroid Build Coastguard Worker  # say the rref has RRefId 100 and ForkId 1
313*da0073e9SAndroid Build Coastguard Worker  rpc.rpc_async('C', func, args=(rref, ))
314*da0073e9SAndroid Build Coastguard Worker
315*da0073e9SAndroid Build Coastguard Worker.. image:: https://user-images.githubusercontent.com/16999635/69164971-d6adcd80-0abe-11ea-971d-6b7af131f0fd.png
316*da0073e9SAndroid Build Coastguard Worker    :alt: user_to_user.png
317*da0073e9SAndroid Build Coastguard Worker    :width: 500 px
318*da0073e9SAndroid Build Coastguard Worker
319*da0073e9SAndroid Build Coastguard WorkerWhen C receives the child ``UserRRef`` from A, it sends out a fork request to
320*da0073e9SAndroid Build Coastguard Workerthe owner B. Later, when the B confirms the ``UserRRef`` on C, C will perform
321*da0073e9SAndroid Build Coastguard Workertwo actions in parallel: 1) send out the child ACK to A ,and 2) run the user
322*da0073e9SAndroid Build Coastguard Workerprovided function. During this time, the parent (A) will hold its
323*da0073e9SAndroid Build Coastguard Worker``UserRRef {100, 1}`` alive to achieve **G2**.
324