Phenom CPU optimizations
Faster hpel_filter by using unaligned loads instead of emulated PALIGNR
Faster hpel_filter on 64-bit by using the 32-bit version (the cost of emulated PALIGNR is high enough that the savings from caching intermediate values is not worth it).
Add support for misaligned_mask on Phenom: ~2% faster hpel_filter, ~4% faster width16 multisad, 7% faster width20 get_ref.
Replace width12 mmx with width16 sse on Phenom and Nehalem: 32% faster width12 get_ref on Phenom.
Merge cpu-32.asm and cpu-64.asm
Thanks to Easy123 for contributing a Phenom box for a weekend so I could write these optimizations.