GPU programming employ a highly aggresive parallelism model. When someone endeavors to program GPUs has to exploit dual levels of parallelism. These are the
coarse grained and
fine grained parallelism models.
The coarse grained model is the one that allows scalability as it relies on the blocks of threads which have no way to synchronize together. Therefore, they are able to execute independently in parallel, feeding all the available compute units of the hardware device. If the programmer has specified enough of them in a kernel invocation they will keep the hardware highly utilized. It resembles programming the CPU at the thread level, although without any efficient option to synchronize multiple threads together.
Speaking about fine grained parallelism is like programming down to the SIMD level provided by the use of SSE or AVX instructions available on modern Intel & AMD x86 CPUs. However, programming is quite more flexible as using the so called SIMT (Singe Instruction Multiple Thread) model, in CUDA or OpenCL programming environments, one can program without even being aware of the SIMD nature. NVidia GPUs' SIMD width is 32 elements while AMD GPUs' width is 64 elements. Practically, the thread block size should be a multiple of this number, especially on NVidia GPUs, because memory access and pipeline latencies require a large number of threads on the compute unit in order to be hidden. At this parallelism level, threads are able to synchronize and communicate by exhanging data through the fast shared memory.
In this sense, GPUs employ parallelism to extreme levels. A modern GPU would require more than a thousand threads per compute unit to keep it utilized and it might consists of dozens of compute units.
GPU programming is both fascinating and dirty!