Provide an example of a kernel that reached high Tensor Core utilization. How did you accomplish it. How would you get the roofline graph of a kernel (wether the kernel is compute-bound or memory-bound). Provide me an example where PTX programming is necessary (I felt like I really impressed my interviewer here by talking about the many PTX exclusive instructions available on NVIDIA GPUs). Do you have experience coding GPU specialised algorithms like FlashAttention? Why are vectorised loads better than non-vectorised loads (they use 128-bit load instruction). Etc... Just really focused on proving my knowledge of how GPUs work and how to programme them.