Wrappers on Wrappers on Wrappers

From low-level to high-level:

Tensor

Storage and Strides

Strides is a tuple that defines how many elements you need to traverse in memory before you reach the next row/column in a dimension in the tensor, i.e. the dimension of the tensor

When transposing or resizing a matrix, only the strides are modified, not the storage.

This leads to the concept of contiguous: a tensor is contiguous if and only if its storage is laid out from the outermost dimension to the inner most, i.e. a normal memory layout

Why does contiguousness matter? Some operations, such as view, only works for contiguous tensors.

view vs. reshape:

When contiguous, they both return a different tensor of the same original storage
However, when non-continuous, view returns an error while reshape creates a contiguous copy of the input tensor’s storage and return the copied storage’s tensor

Dispatching Tensor Operations

Dynamically (Run-time) dispatched according to device
Dispatched according to dtype

Extending Tensors

Method 1: The Extension Point Trinity
Device (CPU, cuda, xla, hip, fpga…)
×
Layout (strided, sparse, mkldnn…)
×
dtype (float, double, int, long, bool…)

Write respective kernels for each of these combinations

Method 2: A Python Wrapper Class around Tensor

Which to use? If the tensor needs to be passed in autograd, make it a Pytorch extension

Autograd

Different Methods for Calculating the Differential

Numerical: limit of slope
Symbolic: similar to Wolfram, returns symbolic expression of derivative
Automatic: uses the chain rule. For each step, the upstream derivative is passed in. Then the respective gradient-calculating function for the forward function has the mathematical formula for the single-step derivative, into which numerical values are plugged in. It outputs an accurate downstream numerical derivative

Variables’ Fields

data: tensor
grad: the numerical gradient of the variable, calculated after a backward pass
grad_fn: which gradient-calculating function should the variable use, determined by the previous operation in the forward tree
is_leaf
requires_grad
_version: used to track whether the variable’s data has changed after it was saved by ctx

The Backward Tree

A backward tree is constructed according to the forward tree.

For each variable with a grad_fn, a gradient-calculating function node is added to the tree
For each operation needing inputs to calculate the gradient (e.g. multiplication), the operation saves the input variables and add them to the corresponding gradient-calculating function node through a context variable ctx
For each variable with is_leaf == True and required_grad == True, accumulate_grad is called, which takes gradients from the previous gradient-calculating function node, aggregates it, and saves it to the variable’s grad field

References

上一篇: 本科两年，得到的与失去的下一篇: [旧文]20岁

百無一用，昆亂不擋。

termina's Blog