git.sesse.net Git - stockfish/log

Use per-thread counterMoveHistory

Drops a scalability bottleneck due to memory contention
of a single shared table across threads. The effect starts
to be sensible with a high number of threads. Specifically
we have a small regression with 7 threads both at 60 and
180 seconds TC:

10000 @ 60+0.6 th 7
ELO: -2.46 +-3.2 (95%) LOS: 6.5%
Total: 9896 W: 1037 L: 1107 D: 7752

5000 @ 180+0.6 th 7
ELO: -1.95 +-4.1 (95%) LOS: 17.7%
Total: 5000 W: 444 L: 472 D: 4084

We have a regression because counterMoveHistory table is
quite big and it takes time for a single thread to fill it.
Sharing the table yields to a higher fill rate and better
quality of moves and up to 7 threads the benefits of sharing
more then compensate the loss in speed due to contention.
Interestingly even with a 3X longer TC, so with more time
for the single thread to catch up, the improvment is quite
limited and below noise level. It seems we really need much
longer TC to saturate the table.

When we move to high threads number it's another story:

5000 @ 60+0.6 th 22
ELO: 3.49 +-4.3 (95%) LOS: 94.6%
Total: 4880 W: 490 L: 441 D: 3949

2000 @ 60+0.6 th 32
ELO: 8.34 +-6.9 (95%) LOS: 99.1%
Total: 2000 W: 229 L: 181 D: 1590

As expected the speed-up more than compensates the filling
rate, and we expect that with tournament TC, where single
thread is able to saturate the table, the difference will
be even stronger. For instance for TCEC 9 super-final time
control will be 180 minutes + 15 seconds and this scalability
improvement seems definitely the way to go.

So, summarizing:

GOOD:

Measured big improvement in high core scenario

Suitable for TCEC 9 superfinal (big hardware, very long TC)

Consistent and natural patch that extends to counterMoveHistory
what we already do for remaining history tables, that are all per-thread

Non functional change for the common case of a single core

Very simple (just 6 lines modified, no added ones)

BAD:

Small regression (within 2-3 ELO) with few threads and short TC

bench: 5341477

Renaming in MovePicker

Rename stages and simplify a bit the code.

No functional change.

Retire MovePicker::see_sign()

No more used after last patch.

No functional change.

Tweak SEE margin in pruning conditions

Use 35 * depth^2 to calculate see_margin.

STC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 22636 W: 4212 L: 3990 D: 14434

LTC:
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 47241 W: 6314 L: 6041 D: 34886

The Movepick SEE is now dead code, retire it.

Bench: 5341477

Integrate next_stage() logic into next_move()

Measured bench speed up goes from 0,7% to 2%,
given the unreliable measure a reverse simmplification
test was done on fishtest:

master vs patch
LLR: -2.94 (-2.94,2.94) [-3.00,1.00]
Total: 15499 W: 2685 L: 2867 D: 9947

Test result is positive, master is weaker.

No functional change.

Simplify code for pinaware SEE

This is the most compact and neatest version
is was able to produce.

On normal builds I have a small slowdown:
normal builds base vs. simplification (gcc 4.8.1 Win7-64 i7-3770 @ 3.4GHz x86-64-modern)
Results for 20 tests for each version:

        Base      Test      Diff
Mean    1974744   1969333   5411
StDev   11825     10281     5874
p-value: 0,178
speedup: -0,003

On pgo-builds however I measure a nice 1.1% speedup

pgo-builds base vs. simplification
Results for 20 tests for each version:

        Base      Test      Diff
Mean    1974119   1995444   -21325
StDev   8703      5717      4623
p-value: 1
speedup: 0,011

No functional change.

Pinned aware SEE

Don't allow pinned pieces to attack the exchange-square as long all
pinners (this includes also potential ones) are on their original
square.
As soon a pinner moves to the exchange-square or get captured on it, we
fall back to standard SEE behaviour.

This correctly handles the majority of cases with absolute pins.

bench: 6883133

Reorder evaluation start

In evaluate, we start by initializing the pos.psq_score
and adding the material imbalance. After that, we check
whether a specialized eval exists and if yes we return
that value and discard whatever we have computed until now.

It sounds more logical to first probe material entry and
return if we have a specialized eval, and only if it is
not the case initialize eval with some values. There is
no measurable speed-difference on my computer.

Non functional change.

Use Movepick SEE value in search

This halves the calls to the costly pos.see_sign(),
speed up is about 1-1.3%

Non functional change.

Refactor previous patch

No functional change.

Prune dangerous moves at low depth

At very low depths prune captures,
promotions and checks if see is negative.

STC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 6772 W: 1328 L: 1173 D: 4271

LTC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 8917 W: 1270 L: 1122 D: 6525

bench: 6024713

Syntactic sugar to loop across pieces

Also add some comments to the new operator~(Piece).

No functional change.

Change from [Color][PieceType] to [Piece]

Speed up of almost 1% in both normal and
pgo builds.

No functional change.

Fix syzygy with partial TB

In case we have installed a not complete set of 6-men tables and
there is 6 piece position on board, but no corresponding
tablebase engine is not using any syzygy at all.

Reported by Jouni Uski, fix by Peter Österlund,
confirmed as a bug by Ronald de Man.

bench: 7591630

Space bonus in presence of open files

If the opponent has a cramped position, opening a file often
helps him/her to exchange pieces, so it makes sense to reduce
the space bonus if there are open files.

Credits: Leonardo Ljubičić for the strategic idea, Alain Savard for the
implementation of the open files calculation, "CrunchyNYC" for the
compensation of the numerator.

STC:
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 49112 W: 9239 L: 8900 D: 30973

LTC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 89415 W: 12014 L: 11601 D: 65800

Bench: 7591630

Change exclusion key setup

Should depend on which move is excluded. This
allow us to remove the dedicated Position::exclusion_key().

STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 59814 W: 11136 L: 11083 D: 37595

LTC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 31023 W: 4187 L: 4080 D: 22756

bench 7553379

Retire linear imbalance

Retire linear imbalance and compensate
in piece values enumeration.

STC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 43596 W: 8105 L: 8023 D: 27468

LTC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 24482 W: 3352 L: 3237 D: 17893

Bench: 7777707

Optimize order of a few conditions in search

Also fix size of KingDanger array to reduce memory footprint.

Small speed up of around 0.5%

No functional change.

Remove condition on killers in history pruning

Now allows main killer to be history prune.

STC:
LLR: 2.94 (-2.94,2.94) [-3.00,1.00]
Total: 15852 W: 2910 L: 2781 D: 10161

LTC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 56428 W: 7610 L: 7537 D: 41281

Bench: 8032058

Tweak probcut threshold

Use better threshold for capture move generation.

STC:
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 23265 W: 4415 L: 4188 D: 14662

LTC:
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 36618 W: 5083 L: 4836 D: 26699

bench: 7030088

Removed an extra space

No functional change.

Move king tropism to evaluate_king

No functional change.

Retire CheckInfo

Move its content directly under StateInfo.

Verified for no speed regression.

No functional change.

Silence some warnings with MSVC 2013

No functional change.

Move CheckInfo under StateInfo

This greately simplifies usage because hides to the
search the implementation specific CheckInfo.

This is based on the work done by Marco in pull request #716,
implementing on top of it the ideas in the discussion: caching
the calls to slider_blockers() in the CheckInfo structure,
and simplifying the slider_blockers() function by removing its
first parameter.

Compared to master, bench is identical but the number of calls
to slider_blockers() during bench goes down from 22461515 to 18853422,
hopefully being a little bit faster overall.

archlinux, gcc-6
make profile-build ARCH=x86-64-bmi2
50 runs each

bench:
base = 2356320 +/- 981
test = 2403811 +/- 981
diff = 47490 +/- 1828

speedup = 0.0202
P(speedup > 0) = 1.0000

perft 6:
base = 175498484 +/- 429925
test = 183997959 +/- 429925
diff = 8499474 +/- 469401

speedup = 0.0484
P(speedup > 0) = 1.0000

perft 7 (but only 10 runs):
base = 185403228 +/- 468705
test = 188777591 +/- 468705
diff = 3374363 +/- 476687

speedup = 0.0182
P(speedup > 0) = 1.0000

$ ./pyshbench ../Stockfish/master ../Stockfish/test 20
run base test diff
...

base = 2501728 +/- 182034
test = 2532997 +/- 182034
diff = 31268 +/- 5116

speedup = 0.0125
P(speedup > 0) = 1.0000

No functional change.

Make engine ONE_PLY value independent

This non-functional change patch is a deep work to allow SF to be independent
from the actual value of ONE_PLY (currently set to 1). I have verified SF is
now independent for ONE_PLY values 1, 2, 4, 8, 16, 32 and 256.

This patch gives consistency to search code and enables future work, opening
the door to safely tweaking the ONE_PLY value for any reason.

Verified for no speed regression at STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 95643 W: 17728 L: 17737 D: 60178

No functional change.

Fixed wrong definition of WhiteCamp and BlackCamp

No functional change.

Simplify stats update

Simplify code by moving countermove and follow-up move
history update into procedure.

No functional change.

Reformat stats update

Rewritten in a way to have explicit in the search
the bonus/penalty we apply: hopefully this will lead
to further simplification/fix of current rather messy
stats update code.

No functional change.

Refutation penalty on captures

Apply refutation penalty for prior PV quiet move on captures

LTC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 8208 W: 1153 L: 1008 D: 6047
http://tests.stockfishchess.org/tests/view/57bc5a9f0ebc59030fbe47b5

Only LTC because a very similar patch already passed STC + LTC

bench: 7038730

Simplify IID

STC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 30468 W: 5687 L: 5582 D: 19199
http://tests.stockfishchess.org/tests/view/57b1ddd80ebc591c761f63e2

LTC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 87406 W: 11756 L: 11725 D: 63925
http://tests.stockfishchess.org/tests/view/57b212590ebc591c761f63f9

bench: 6554900

Do LMR on captures

STC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 5361 W: 1086 L: 936 D: 3339
http://tests.stockfishchess.org/tests/view/57b31b0f0ebc591c761f643d

LTC:
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 54694 W: 7591 L: 7287 D: 39816
http://tests.stockfishchess.org/tests/view/57b3442b0ebc591c761f6450

bench: 6881120

Remove a stale assignment

No more used after previous patch.

Spotted by Jekaa .

No functional change.

Retire pawn span

Retire pawn span and replace with pawn count in evaluate_scale_factor.

STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 26482 W: 4929 L: 4818 D: 16735

LTC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 61938 W: 8400 L: 8335 D: 45203

Bench: 7662861

Use predicted depth for history pruning

STC: (Yellow)
LLR: -2.96 (-2.94,2.94) [0.00,4.00]
Total: 69115 W: 12880 L: 12797 D: 43438

LTC:
LLR: 2.96 (-2.94,2.94) [0.00,4.00]
Total: 124163 W: 16923 L: 16442 D: 90798

Note: Note based off past experiments / patches... history pruning
is quite TC sensitive. I believe the reason for this TC dependency
is that the CMH/FMH is a very large table that takes time to fill
up with. In addition having more time for will increase the accuracy
of the stats' value.

Bench: 7351698

Cap space evaluation bonus

When computing space evaluation, limit the bonus square count to 16.

STC @ 10+0.1 th 1:
LLR: 2.97 (-2.94,2.94) [0.00,5.00]
Total: 30793 W: 5910 L: 5648 D: 19235

LTC @ 60+0.6 th 1:
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 31361 W: 4410 L: 4184 D: 22767

Bench: 7165385

Simplify space formula

STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 16868 W: 3260 L: 3132 D: 10476

LTC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 16910 W: 2381 L: 2255 D: 12274

bench: 6663531

Use Color-From-To history stats to help sort moves

STC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 33502 W: 6498 L: 6223 D: 20781
http://tests.stockfishchess.org/tests/view/578abb940ebc5972faa169e2

LTC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 50782 W: 7124 L: 6832 D: 36826
http://tests.stockfishchess.org/tests/view/578b8e5d0ebc5972faa169fd

LTC: (Sanity test against latest master)
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 32759 W: 4600 L: 4370 D: 23789
http://tests.stockfishchess.org/tests/view/5798b7d30ebc591c761f5b72

bench: 6985912

P.S. Thanks @mstembera for rewriting my code to make it smp compatible. A BIG thank you!

Futility tweak

Use a different margin for pruning child nodes.

STC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 16692 W: 3251 L: 3051 D: 10390
http://tests.stockfishchess.org/tests/view/579b95d10ebc591c761f5c03

LTC:
LLR: 2.97 (-2.94,2.94) [0.00,5.00]
Total: 24140 W: 3501 L: 3297 D: 17342
http://tests.stockfishchess.org/tests/view/579bb15d0ebc591c761f5c0b

Bench: 7927017

Allow null pruning at depth 1

This removes a check that prevents null pruning at depth 1 PLY.

STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 23445 W: 4638 L: 4521 D: 14286

LTC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 61416 W: 8627 L: 8563 D: 44226

bench: 8145304

See prune at higher depth

Allow SEE pruning at higher depths in shallow depth
pruning using a threshold increasing with depth.

STC
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 35366 W: 7011 L: 6724 D: 21631

LTC
LLR: 2.97 (-2.94,2.94) [0.00,5.00]
Total: 15578 W: 2243 L: 2070 D: 11265

Bench: 8417887

Gradually relax the NMP staticEval check

Gradually relax the NMP staticEval check as we go to
higher depths.

Use tuned values.

STC
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 16745 W: 3371 L: 3168 D: 10206

LTC:
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 5906 W: 875 L: 736 D: 4295

bench: 8548212

Workaround gcc stack alignment bug

GCC uses SSE instructions to move data but in 32-bit gcc version used
by abrok the stack is not 16-byte aligned due to a bug.

This patch workaround teh bug by not using the stack
to store KingFlank[]

Fixes issue #721.

No functional change.

Fix extract_ponder_from_tt()

Checking for legality of a possible ponder move
must be done before we undo the first pv move,
of course. (spotted by mohammed li.)

This obviously only has any effect when playing in ponder mode.

No functional change.

King tropism

Bonus for each square that we attack in the flank where the opponent
king is. Squares that we attack twice and are not protected by an enemy
pawn count double.

Passed STC:
http://tests.stockfishchess.org/tests/view/577dfca60ebc5972faa166b8
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 48373 W: 9832 L: 9481 D: 29060

And LTC:
http://tests.stockfishchess.org/tests/view/577e77870ebc5972faa166df
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 8881 W: 1408 L: 1255 D: 6218

Bench: 7577046

More safe checks

Consider a check given by a rook or a minor to be a "safe check"
also in the case where supported by another piece,
and given on a square only defended by a queen

Was yellow STC
http://tests.stockfishchess.org/tests/view/576fcbc80ebc5972faa163e8
LLR: -2.96 (-2.94,2.94) [0.00,5.00]
Total: 55453 W: 10431 L: 10315 D: 34707

Passed LTC
http://tests.stockfishchess.org/tests/view/57733a0b0ebc5972faa164b7
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 54550 W: 7671 L: 7365 D: 39514

bench: 7398346

Use staticEval in null prune condition

Don't null prune at depth < 12 if staticEval < beta

STC:
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 62858 W: 12035 L: 11632 D: 39191

LTC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 49784 W: 7009 L: 6720 D: 36055

bench: 8054611

Removing inCheck condition for counter move bonus

STC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 20206 W: 3946 L: 3823 D:

LTC:
LLR: 3.10 (-2.94,2.94) [-3.00,1.00]
Total: 25004 W: 3512 L: 3390 D: 18102

Bench: 8172428

Restore standard passed pawn definition

Use the usual and accepted passed pawn semantic
instead of a non-standard one and remove a FIXME.

STC (http://tests.stockfishchess.org/tests/view/576401350ebc5972faa1608d):
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 29646 W: 5663 L: 5557 D: 18426

LTC (http://tests.stockfishchess.org/tests/view/5764e4e90ebc5972faa160c3):
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 40224 W: 5578 L: 5484 D: 29162

bench: 7543902

Remove redundant PvNode condition

After commit 6d58bf777caa323 we always call PvNodes
with cutNode set to false.

No functional change.

Comment out a redundant condition

Take advantage that VALUE_NONE = 32002 to remove
the condition.

Commented out and not removed becuase it is tricky
to rely on the hidden value of VALUE_NONE and code
can break in case we change VALUE_NONE in the future.

No functional change.

Remove scalefactor dependency

STC
http://tests.stockfishchess.org/tests/view/5764539e0ebc5972faa160a4
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 43878 W: 8289 L: 8208 D: 27381

LTC
http://tests.stockfishchess.org/tests/view/5764f0130ebc5972faa160c9
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 39338 W: 5408 L: 5313 D: 28617

bench: 7977279

On IID do not always search with cutNode = true

On IID now search with cutNode value instead of fixed value true.

STC (http://tests.stockfishchess.org/tests/view/575fa3860ebc5972faa15f67):
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 163974 W: 30744 L: 30874 D: 102356

LTC (http://tests.stockfishchess.org/tests/view/5763b0640ebc5972faa16075):
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 48363 W: 6611 L: 6528 D: 35224

Bench: 7806393

Simplify Check Extension

STC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 32704 W: 6146 L: 6045 D: 20513

LTC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 146622 W: 19967 L: 20017 D: 106638

Bench: 8245662

Don't insert pv back into tt

This code was added before the accurate pv patch, when
we retrieved PV directly from TT.

It's not required for correct (and long) PVs any more and
should be safe to remove it.

Also, allowing helper threads to repeatedly over-write
TT doesn't seem to make sense(that was probably an un-intended
side-effect of lazy smp). Before Lazy SMP only Main Thread used
to run ID loop and insert PV into TT.

STC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 74346 W: 13946 L: 13918 D: 46482

LTC
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 47265 W: 6531 L: 6447 D: 34287

bench: 8819179

Do not hardcode Debug Log File

Allow to specifiy the log file name, this comes
handy in case of self-matches so that each SF
instance writes into a different log file.

No functional change.

Filter root moves filter before copy to threads

Currently root moves are copied to all teh threads
but are DTZ filtered only in main thread at the
beginning of teh search.

This patch moves the TB filtering before the
copy of root moves fixing issue #679

https://github.com/official-stockfish/Stockfish/issues/679

No bench change.

Stat Formula Tweak

bonus = d * d + 2 * d - 2

STC:
LLR: 2.94 (-2.94,2.94) [0.00,4.00]
Total: 99444 W: 18274 L: 17778 D: 63392

LTC:
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 89757 W: 12285 L: 11896 D: 65576

bench: 8276130

Tweak check extension condition

There are two concepts with this patch:

Limit check extensions by using move count.
The idea is to limit search explosion.

Always extend check if the first move gives check.
The idea is to save expensive SEE calls, since the vast
majority of first move will have SEE value >= 0, also
first move may still be strong even if the SEE is negative.

STC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 16503 W: 3068 L: 2873 D: 10562

LTC:
LLR: 2.97 (-2.94,2.94) [0.00,5.00]
Total: 37202 W: 5261 L: 5014 D: 26927

bench: 8543366

Small Queen simplification

Moving a few lines from evaluate_threats to evaluate_pieces allows to
a) Remove a condition pos.count<QUEEN>(Them) == 1
b) use precalculated s instead of pos.square(Them)
c) do not check the condition at all in queenless endings

Passed STC
http://tests.stockfishchess.org/tests/view/5752e0410ebc59029919b1f4
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 67175 W: 12194 L: 12152 D: 42829

and LTC
http://tests.stockfishchess.org/tests/view/57587b140ebc59029919b2f4
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 20276 W: 2774 L: 2653 D: 14849

bench: 7907962

Avoid some redundant scaling function calls

Posted by Mohammed Li here:
https://groups.google.com/forum/?fromgroups=#!topic/fishcooking/N-PHfN0O79o

No functional change.

LMR reduction parameter tweak

More reduction for cut nodes, less for moves that escape a capture:

STC (http://tests.stockfishchess.org/tests/view/57548c1e0ebc59029919b247):
LLR: 2.96 (-2.94,2.94) [0.00,4.00]
Total: 60165 W: 11519 L: 11149 D: 37497

LTC (http://tests.stockfishchess.org/tests/view/57555b570ebc59029919b260):
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 10353 W: 1493 L: 1317 D: 7543

Bench: 8902859

Fix syzygy DTZ bug

In this position: 3K4/8/3k4/8/4p3/4B3/5P2/8 w - - 0 5

Current DTZ probe returns 1 instead of 15

What happens is that the double push f4 is erroneously detected as a win move.

After the push we have:

[D]3K4/8/3k4/8/4pP2/4B3/8/8 b - f3 0 5

And here the code misses the possible ep capture exf3.

The bug is in probe_dtz_no_ep() where is used probe_ab() that is
blind to ep captures so it returns v == 2 (win) for position

3K4/8/3k4/8/4pP2/4B3/8/8 b - f3 0 5

Note that at the caller site the original position did not have any
possible ep capture, so probe_dtz() returns immediately after calling
probe_dtz_no_ep().

The fix is to call the ep-aware probe_wdl() instead of probe_ab()

I have verified that DTZ is correct now and also there are no more
mistmatches compared to the new 'syzygy' branch. Tested on a set of
more than 600 endgame positions, included some tricky ones.

For people interested to redo the test or doing additional tests
please pull branch tb_dbg from https://github.com/mcostalba/Stockfish repo.

bench: 8450534 (bench unaffected because syzygy is not exercized during bench)

Simplify Futility Pruning

Don't update bestValue when futility pruning.

STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 21933 W: 4031 L: 3912 D: 13990

LTC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 46225 W: 6115 L: 6028 D: 34082

Bench: 8450534

LMR Simplification

LMR simplification that also gives a slight ELO gain, especially at LTC:

STC (http://tests.stockfishchess.org/tests/view/574ec8e20ebc59029919b147):
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 32402 W: 5967 L: 5866 D: 20569

LTC (http://tests.stockfishchess.org/tests/view/574fbebf0ebc59029919b16d):
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 15103 W: 2103 L: 1975 D: 11025

Bench: 8248133

Tuned values for piece check and attack unit factors

A middle ground patch of two successful tuning patches,
one at STC, the other at LTC, which now passed both.

LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 67893 W: 12777 L: 12384 D: 42732

LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 30165 W: 4189 L: 3960 D: 22016

bench: 9209507

Pins or discovered attacks on the opponent's queen

Bonus for pins or discovered attacks on the opponent's queen

STC:
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 32020 W: 5914 L: 5652 D: 20454

LTC:
LLR: 2.97 (-2.94,2.94) [0.00,5.00]
Total: 10946 W: 1530 L: 1375 D: 8041

Bench: 7031649

Teach check_blockers to check also non-king pieces

This is a prerequisite for next patch

No functional change.

Simplify doubled pawn

Only use doubled pawn malus when the doubled pawns are on consecutive squares.

Passed STC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 7678 W: 1469 L: 1325 D: 4884

And LTC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 26739 W: 3562 L: 3449 D: 19728

Bench: 8211685

More detailed dependence of time allocation on the magnitude of score change

10+0.1:
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 5657 W: 1130 L: 979 D: 3548

60+0.6:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 36884 W: 5002 L: 4762 D: 27120

bench: 8428997

Assorted pruning tweaks

LTC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 38257 W: 5206 L: 4961 D: 28090

STC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 16550 W: 3110 L: 2914 D: 10526

Bench: 8428997

Fix a multiPV bug in lazy SMP

Where the helper threads were not doing multiPV at all.

Regression tested sprt @ 5+0.05 th 7

LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 73918 W: 11891 L: 11853 D: 50174

bench: 8716243

Double pawn simplification

Try doubled pawn simplification, with psq
table compensation.

STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 36094 W: 6558 L: 6463 D: 23073

LTC:
LLR: 2.94 (-2.94,2.94) [-3.00,1.00]
Total: 102352 W: 13417 L: 13404 D: 75531

Bench: 8716243

Merge good and bad quiets

STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 58613 W: 10779 L: 10723 D: 37111

LTC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 33608 W: 4539 L: 4436 D: 24633

Bench: 9441294

Unsafe checks

Introducing a new multi-purpose penalty related to King safety, which
includes all kind of potential checks (from unsafe or unavailable
squares currently occupied by some other piece)

This will indirectly detect and reward some pins, discovered checks, and
motifs such as square vacation, or rook behind its pawn and aligned with
King (example Black Rg8, g7 against Kg1),
and penalize some pawn blockers (if they move, it allows a discovered
check by the pawn).

And since it looks also at protected squares, it detects some potential
defense overloading.

Finally, the rook contact checks had been removed some time ago. This
test will give a small bonus for them, as well as for bishop contact
checks.

Passed STC
http://tests.stockfishchess.org/tests/view/5729ec740ebc59301a354b36
LLR: 2.94 (-2.94,2.94) [0.00,5.00]
Total: 13306 W: 2477 L: 2296 D: 8533

and LTC
http://tests.stockfishchess.org/tests/view/572a5be00ebc59301a354b65
LLR: 2.97 (-2.94,2.94) [0.00,5.00]
Total: 20369 W: 2750 L: 2565 D: 15054

bench: 9298175

Retire __popcnt64 intrinsic

Just use _mm_popcnt_u64() that is available
both for MSVC abd Intel compiler.

Verified on MSVC that the produced assembly
has the hardware 'popcnt' instruction.

No functional change.

Simplify History LMR Formula

STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 41713 W: 7589 L: 7504 D: 26620

LTC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 41353 W: 5484 L: 5391 D: 30478

Bench: 8946983

Fix a warning with MSVC

Introduced by 2dd24dc4e618dc7b ("Use popcount intrinsic with Intel")

No functional change.

Fix LazySMP when searching to a fixed depth.

Currently, helper threads will only search up to the
specified depth limit. Now let them search until the
main thread has finished the specified depth.

On the other hand, we don't want to pick a thread with
a higher search depth.

This may be considered cheating. ;-)

No functional change.

Use popcount intrinsic with Interl compiler

It seems that icc used our fallback version of popcount.
Now use intrinsics.

icc version 16.0.2 (gcc version 5.3.0 compatibility)
bmi2 compile
uname -r 4.5.1-1-ARCH

20xbench gives a nice speedup
./stockfish-icc-master 2161515 +- 34462
./stockfish-icc-sse42 2260857 +- 50349

Remove useless -mbmi flag in Makefile

I could not find anything documented that is necessary that prepending -mbmi to -mbmi2 gives some benefit.
Instead at
https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions

The following built-in functions are available when -mbmi is used. All of them generate the machine instruction that is part of the name.
unsigned int __builtin_ia32_bextr_u32(unsigned int, unsigned int);
unsigned long long __builtin_ia32_bextr_u64 (unsigned long long, unsigned long long);

The following built-in functions are available when -mbmi2 is used. All of them generate the machine instruction that is part of the name.
unsigned int _bzhi_u32 (unsigned int, unsigned int)
unsigned int _pdep_u32 (unsigned int, unsigned int)
unsigned int _pext_u32 (unsigned int, unsigned int)
unsigned long long _bzhi_u64 (unsigned long long, unsigned long long)
unsigned long long _pdep_u64 (unsigned long long, unsigned long long)
unsigned long long _pext_u64 (unsigned long long, unsigned long long)

and at
https://gcc.gnu.org/ml/gcc/2014-02/msg00204.html

( "... The real optimization comes from being able to use pext
(parallel bit extract), which can implement several bextr expressions in
parallel.")

Apart from that we don't use all -msse -msse2 -msse3 -msse4.2 etc. but just -msse3 (or -msse4.2) only.

As regards to the speedup within noise level - this pull request is actually reversal of mcostalba#198 wherein prepending -mbmi to -mbmi2 was claimed to be 0.3% faster and here (removing -mbmi) gives 0.4% speed gain.

Isolated pawn simplification

STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 117822 W: 21697 L: 21744 D: 74381

LTC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 92307 W: 12330 L: 12305 D: 67672

Bench: 8813983

Resolves #659

Use FMHs to assist with LMR formula.

STC:
LLR: 2.99 (-2.94,2.94) [0.00,5.00]
Total: 52232 W: 9654 L: 9304 D: 33274

LTC:
LLR: 2.97 (-2.94,2.94) [0.00,5.00]
Total: 115988 W: 15550 L: 15049 D: 85389

Bench: 7890808

Resolves #651

Use -O3 for all compilers (including ICC)

There seems to be no benefit from using -fast over -O3 with icc.
So use -O3 everywhere.

No functional change

Resolves #652

Remove some pointless micro-optimizations

Seems to give around 1% speed-up for CPUs with popcnt support.
Seems to give a very minor speed-up for CPUs without popcnt.

No functional change

Resolves #646

Fix incorrect draw detection

In this position we should have draw for repetition:

position fen rnbqkbnr/2pppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1 moves g1f3 g8f6 f3g1
go infinite

But latest patch broke it.

Actually we had two(!) very subtle bugs, the first is that Position::set()
clears the passed state and in particular 'previous' member, so
that on passing setupStates, 'previous' pointer was reset.

Second bug is even more subtle: SetupStates was based on std::vector
as container, but when vector grows, std::vector copies all its contents
to a new location invalidating all references to its entries. Because
all StateInfo records are linked by 'previous' pointer, this made pointers
go stale upon adding more element to setupStates. So revert to use a
std::deque that ensures references are preserved when pushing back new
elements.

No functional change.

Add a second level of follow-up moves

STC:
LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 6438 W: 1229 L: 1077 D: 4132

LTC:
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 4000 W: 605 L: 473 D: 2922

bench: 7378965

Resolves #636

StateInfo is usually allocated on the stack by search()

And passed in do_move(), this ensures maximum efficiency and
speed and at the same time unlimited move numbers.

The draw back is that to handle Position init we need to
reserve a StateInfo inside Position itself and use at
init time and when copying from another Position.

After lazy SMP we don't need anymore this gimmick and we can
get rid of this special case and always pass an external
StateInfo to Position object.

Also rewritten and simplified Position constructors.

Verified it does not regress with a 3 threads SMP test:
ELO: -0.00 +-12.7 (95%) LOS: 50.0%
Total: 1000 W: 173 L: 173 D: 654

No functional change.

Fix last search info carried over to mate position

When starting search in a mate or stalemate position, Stockfish does not
even care to reinitialize and start worker threads. However after search
all threads are checked for the best move.

This can lead to bestmove and info beeing carried over from the last
search.

Example session:

    setoption name threads value 7
    go movetime 4000
    position startpos moves f2f3 e7e5 g2g4 d8h4
    go movetime 4000

Actual output is like (almost always):

    [...]
    bestmove e2e4
    info depth 0 score mate 0
    info depth 20 seldepth 29 multipv 1 score cp 28 [...] pv e2e4
    bestmove e2e4

Expected output / output after fix:

    [...]
    bestmove e2e4 ponder e7e6
    info depth 0 score mate 0
    bestmove (none)

Resolves #623

Hide global visibility when not needed

Also move PieceValue definition in psqt.cpp,
where it is initialized.

Fix a warning in popcount16() with Intel compiler

No functional change.

Fix Travis Cl

Broken after "32-bit/64-bit Makefile fix" commit.

Ubuntu "Precise" 12.04.5 supports multilib only until
g++ 4.6 that is not enough to compile Stockfish.

So move to Ubuntu 14.04.4 LTS (Trusty Tahr)

No functional change.

Small passed pawn simplification

STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 21993 W: 4197 L: 4078 D: 13718

LTC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 67213 W: 9135 L: 9077 D: 49001

Bench: 7482426

Resolves #622

Undefended King Ring

There was already a penalty for squares only defended by King (undefended)

This test records a penalty for completely undefended squares in the so called extended king-ring
(so if we exclude squares defended by a Kg8 for example, we only look at h6 g6 and f6)

We also exclude squares occupied by opponent pieces in this computation,
based on the following results

Was yellow at STC
LLR: -2.97 (-2.94,2.94) [0.00,5.00]
Total: 112499 W: 20649 L: 20293 D: 71557

and passed LTC
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 36805 W: 5100 L: 4857 D: 26848

Bench: 8430233

Resolves: #619

Backward simplication

On top of the usual conditions
a) some opponent in front (but no lever)
b) some neighbours (in front) (but no neighbour behind or same rank)
c) < rank_5

to find out if a pawn is backward we look at the squares in front of this pawn to reach the same rank as the next neighbour.

In current master, a pawn is backward if any of those squares is controlled by an enemy pawn on an adjacent file

In this version, a pawn is ALSO backward if any of those squares is occupied by an enemy pawn.

STC:
http://tests.stockfishchess.org/tests/view/56fe7efd0ebc59301a3541f1
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 19051 W: 3557 L: 3433 D: 12061

LTC:
http://tests.stockfishchess.org/tests/view/56febc2d0ebc59301a354209
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 40810 W: 5619 L: 5526 D: 29665

Bench: 7525245

Resolves #614

Simplify popcnt

Also a speedup(about 1%) on 64-bit w/o hardware popcnt

Retire Max15 and Full template parameters
(Contributed by Marco Costalba)

Now that we have just SW and HW versions, use
template default parameter to get rid of explicit
template parameters.

Retire bitcount.h and move the only defined
function to bitboard.h

No functional change

Resolves #620

32-bit/64-bit Makefile fix

Counter intuitively, make build ARCH=x86-32 does NOT produce a 32-bit compile
when running a 64-bit OS. Nor would ARCH=x86-64 produce a 64-bit compile when
running a 32-bit OS (assuming it compiled w/o errors).

No functional change

Resolves #621

A combo patch of two tuning patches

STC:
LLR: 2.96 (-2.94,2.94) [0.00,4.00]
Total: 14223 W: 2700 L: 2494 D: 9029

LTC:
LLR: 2.96 (-2.94,2.94) [0.00,4.00]
Total: 66294 W: 9065 L: 8739 D: 48490

Bench: 7607385

Resolves #612

Guard against UB in lsb/msb

lsb(b) and msb(b) are undefined when b == 0. This can lead to subtle bugs, where
the resulting code behaves differently on different configurations:
- It can be the home grown software LSB/MSB
- It can be the compiler generated software LSB/MSB (when using compiler
  intrinsics without the right compiler flags to allow compiler to use hardware
  LSB/MSB). Which of course depends on the compiler.
- It can be hardware LSB/MSB generated by the compiler.
- Not to mention that hardware LSB/MSB can return different value on different
  hardware when b == 0.

No functional change

Resolves #610

Rewrite bsfq management

Use compiler intrinsics when possible to
avoid writing platform specific asm code.

Tested on Windows 7 with MSVC 2013 and mingw 4.8.3 (32 and 64 bit)
and on Linux Mint with g++ 4.8.4 and clang 3.4 (32 and 64 bit).

No functional change

Resolves #609

Bonus for loose enemies

STC:
LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 30504 W: 5743 L: 5485 D: 19276

LTC:
LLR: 2.97 (-2.94,2.94) [0.00,5.00]
Total: 11936 W: 1651 L: 1493 D: 8792

Bench: 8880041

Resolves #606