z=x*y+3,
what if there is another function that does:
w=x+2*y
and then both functions do a backward pass (simultaneously, perhaps in different threads or otherwise); then it seems dangerous to collect the results of the backward pass (partial derivatives) in the shared variables x and y and make them accessible through x.get_grad() and y.get_grad(). Imho, in a better design, you'd say z.get_grad(x) and z.get_grad(y), and w.get_grad(x) and w.get_grad(y) to get the partial derivatives.
However, I think PyTorch does it the same way (?), at least they say something like this in their docs.
"This function accumulates gradients in the leaves - you might need to zero .grad attributes or set them to None before calling it." - https://docs.pytorch.org/docs/stable/generated/torch.autogra...
The rust burn crate does it better, they store the backprop'd gradients in a separate container and return it: https://github.com/tracel-ai/burn/blob/af381ee18566fc27f5c98...
I wanted to store the graph in a heap to be able to send it to the gpu later on, but then I got lazy and abandoned it. But you always learn something. :)