Data movement moving data between gpu and cpu incurs overhead
Compute Unified Device Architecture Assignment Answers
Question:
a) What are CUDA’s limitations on inter-thread (between thread) communication
Compute Unified Device Architecture Answer and Explanation
3. Limited Communication Across Blocks: CUDA threads in different blocks cannot directly communicate or synchronize with each other. Inter-block communication typically requires using global memory, which can be slower due to higher latency compared to shared memory access.
4. Global Memory Access: While global memory allows communication between threads in different blocks, it's significantly slower than shared memory due to its higher latency and lower bandwidth.