Shuffle both chroma components together as a 16 bit unit, and
don't write the unchanged columns (like in x264_deblock_h_luma_neon
and in the aarch64 version of the function).
This causes a minor slowdown for x264_deblock_v_chroma_neon, but
it is negligible compared to the speedup.