Home

Tinygrad Sidequest

Some random notes while exploring TinyGrad, a cool open source Tensor Compiler. Found a little bug too and patched it!

- [ ] Grok this https://github.com/tinygrad/tinygrad/pull/14186/files
    - [x] What are the inputs and outputs of this file? 
        Two Float32 nxn matricies. 
        MatMul of the two matricies
    - [x] How does this turn to binary? 
      - [x] How does renderer/amd/elf.py work?
        - [x] What does the renderer do?
          Convert linearlized UOps to backend code
        - [x] What backend code is used for AMD?
          Binary
        - [x] So elf.py takes in Linearized UOps and outputs binary?
          Yes ≧◡≦
        - [x] How do they know what binary to generate? 
          They read the ISA spec (for a specific architecture that includes multiple GPU's)
        - [x] Where is the ISA spec?
            https://docs.amd.com/api/khub/documents/uQpkEvk3pv~kfAb2x~j4uw/content
        - [x] Can I run this? 
            Not on my mac.
        - [x] Could I run the CDNA4 mat mul? 
            Not without 90k for a contracted term lol, or an in for ssh from george
        - [ ] How can I run the RDNA one? 
            I can offer to pay someone for ssh access on the discord, or put together my own machine
- [x] What is [this](https://github.com/tinygrad/tinygrad/pull/15492) PR doing?
  - [x] What is the purpose of where_on_load?
    To bind to conditional load UOps
  - [x] Why is or_cast added to where_on_load?
    Where_on_load recieves matched node as input. or_case holds the cast node if it is matched
  - [x] Why did they extract idx to a seperate line? 
    Prob would have exceeded the ruff line limit lol
  - [x] What does idx mean? 
    idx of data (INDEX UOp) from memory buffer that LOAD reads from. (also contains validity mask)
  Tinygrad folds conditional BUFFER reads into the validity mask of INDEX Ops. This updates the patterns to match casted reads.
- [x] What is [this](https://github.com/tinygrad/tinygrad/pull/15512) PR doing?
    - [x] What is the change in ops.py doing?
      - [x] What function is the change contained in?
        _min_max(self) -> tuple[PyConst, PyConst]
      - [x] What does _min_max(self) do?
        - [x] What is the input of the function?
            A UOp Node
        - [x] What is the output of the function
            A Tuple representing (min_val,max_val)
        - [x] What is a PyConst
            A type alias for int, bool, and float.
        - [x] Why must the value of output be of type PyConst?
            Because this is python code, not GPU code, so any bounds must be represented as a scalar const that python supports.
      - [x] Why is the _min_max(self) function useful?
        Because knowing the range of values a UOp is constrained to is useful when optimizing the IR 
        (e.g. if the val is above 0, you would ignore above 0 checks)
        If the min and max of s1 is -1, then s1 _is_ -1.
        If s0 is an int (s0min dtype s0max dyte both int) then s0 is an int
        If the self op is xor you know that xor is a int raised to -1 which is a bitwise not on the int
        You can change the min and max value to the value of the bitwise not of the int 
        Instead of letting it default to the min and max of the dtype, which was the behavior before the PR
   - [x] What is the change in symbolic.py doing?
     - [x] How is this matching XOR?
       ^ _is_ XOR in py.
    Simplifying XOR(y,(XOR(y,x)) to x. 
  Adding in XOR specific code to symbolic and _min_max.
- [x] What is [this](https://github.com/tinygrad/tinygrad/pull/15481) PR doing?
  - [x] What does it mean to cast an Expand?
    - [x] What is EXPAND?
        - [x] What front end code generates the Expand Op?  
            Tensor([[1, 2, 3]]).expand(4,3)
        - [x] What will a parent of EXPAND see in the UOp DAG?
            UOps primarily rely on dtype and shape. The parent will see the expanded shape of the EXPAND with the dtype of the child of EXPAND.
        - [x] How is expand represented in the IR?  
            EXPAND(arg=(4, 3)) <- COPY(to METAL) <- RESHAPE(arg=(1, 3)) <- BUFFER(dtype=int, size=3, device=PYTHON)
        A UOp that moves data into a new arangement by tiling memory (ultimately in the GPU). It takes in a source UOp and a shape to broadcast it to.
    It means to have a CAST UOp as the parent of an EXPAND UOp.
  - [x] Why was EXPAND being casted in _mixin (through broadcasted)?
    - [x] What does it mean for EXPAND to have a backward "sum reduction"?
        - [x] How should I visualize this? 
            Start with root. To left of root are operations farthest down in python file. Nodes at left of graph are at top of python file. (obviously this is only for func prog with 1 op per line)
        You SUM the gradient's of expand's children (which are available b/c they are after expand in the graph) to account for the fact that expand duplicates it's children. # note "child" here follows viz / dataflow convention 
    - Reduce loss of percision.
  Moving the cast to .gradient so that you only see the CAST Op's in the UOp DAG when you call .backward() (as opposed to just calling .expand). And it does it with a UPat!
- [x] I need to understand [this](https://github.com/tinygrad/tinygrad/pull/15463) pr
  This commit adds `weakint` to the promotion lattice, which is an object used for [type promotion](https://docs.jax.dev/en/latest/jep/9407-type-promotion.html).

The promotion lattice is a hashmap used to create a dag of dtype nodes. A helper finds the LCA.

Why does this not use the LeetCode algo for LCA?

- The LeetCode algo traverses from root to leaf, this is from leaves to lca.
    - Is this less efficient, but that doesn't matter because n is small?
        - Yes

Why would the user create a weak int instead of defining their dtype up front?

- It is what python int defaults to.
    - Why? Why not default python int to something.
        - Annoying upcasting. `Tensor([1.0],[2.0]).half + 3` could cast the Tensor's float 16 to float 32 if you defaulted 3 to a int32 instead of a weakint.

Are there any ML frameworks where all dtypes must be defined or they will error?

- There are flags in torch and jax, but otherwise no, would be too annoying.
- [x] I need to understand [this](https://github.com/tinygrad/tinygrad/pull/15356) pr
  Refactoring shared logic (`bitwise_not()` and `elementwise.py`) that touched Tinygrad's inverse/min/max functions. These functions are interesting because they are dtype dependent and also require an interesting property.

1. There is no MIN op in Tinygrad. MIN is just `inverse(MAX(inverse(X),inverse(Y)))`. Inverse must therefore be an [involution](https://en.wikipedia.org/wiki/Involution_(mathematics)).
2. bool, float, and int are stored differently in memory. Therefore inversion must be handled differently. (e.g. a bitwise not might be useful in inverting an int, but will scramble a float).
- [x] I need to understand [this](https://github.com/tinygrad/tinygrad/pull/15367) pr
  In order to prevent a runtime error from being raised when the user called `.gradient(some_tensor)` after writing `another_tensor.copysign(some_tensor)` there was a hack in `copysign()` that added `other.sign()*0` to link `other` to the rest of the graph so that `.backwards` could reach `other` from the root.

Mathematically, if a tensor was not connected to the root via the backwards pass, then even if its gradient wasn't materialized, its derivative with respect to the root (the loss), is 0. So the `materialize_grads` parameter was removed and gradient just returns 0 if something doesn't exist.