Marco Costalba [Mon, 29 Aug 2016 07:11:20 +0000 (09:11 +0200)]
Use per-thread counterMoveHistory
Drops a scalability bottleneck due to memory contention
of a single shared table across threads. The effect starts
to be sensible with a high number of threads. Specifically
we have a small regression with 7 threads both at 60 and
180 seconds TC:
We have a regression because counterMoveHistory table is
quite big and it takes time for a single thread to fill it.
Sharing the table yields to a higher fill rate and better
quality of moves and up to 7 threads the benefits of sharing
more then compensate the loss in speed due to contention.
Interestingly even with a 3X longer TC, so with more time
for the single thread to catch up, the improvment is quite
limited and below noise level. It seems we really need much
longer TC to saturate the table.
When we move to high threads number it's another story:
As expected the speed-up more than compensates the filling
rate, and we expect that with tournament TC, where single
thread is able to saturate the table, the difference will
be even stronger. For instance for TCEC 9 super-final time
control will be 180 minutes + 15 seconds and this scalability
improvement seems definitely the way to go.
So, summarizing:
GOOD:
Measured big improvement in high core scenario
Suitable for TCEC 9 superfinal (big hardware, very long TC)
Consistent and natural patch that extends to counterMoveHistory
what we already do for remaining history tables, that are all per-thread
Non functional change for the common case of a single core
Very simple (just 6 lines modified, no added ones)
BAD:
Small regression (within 2-3 ELO) with few threads and short TC
This is the most compact and neatest version
is was able to produce.
On normal builds I have a small slowdown:
normal builds base vs. simplification (gcc 4.8.1 Win7-64 i7-3770 @ 3.4GHz x86-64-modern)
Results for 20 tests for each version:
Base Test Diff
Mean 19747441969333 5411
StDev 11825 10281 5874
p-value: 0,178
speedup: -0,003
On pgo-builds however I measure a nice 1.1% speedup
pgo-builds base vs. simplification
Results for 20 tests for each version:
Base Test Diff
Mean 19741191995444 -21325
StDev 8703 5717 4623
p-value: 1
speedup: 0,011
Don't allow pinned pieces to attack the exchange-square as long all
pinners (this includes also potential ones) are on their original
square.
As soon a pinner moves to the exchange-square or get captured on it, we
fall back to standard SEE behaviour.
This correctly handles the majority of cases with absolute pins.
In evaluate, we start by initializing the pos.psq_score
and adding the material imbalance. After that, we check
whether a specialized eval exists and if yes we return
that value and discard whatever we have computed until now.
It sounds more logical to first probe material entry and
return if we have a specialized eval, and only if it is
not the case initialize eval with some values. There is
no measurable speed-difference on my computer.
Marco Costalba [Sat, 3 Sep 2016 06:19:17 +0000 (08:19 +0200)]
Fix syzygy with partial TB
In case we have installed a not complete set of 6-men tables and
there is 6 piece position on board, but no corresponding
tablebase engine is not using any syzygy at all.
Reported by Jouni Uski, fix by Peter Österlund,
confirmed as a bug by Ronald de Man.
If the opponent has a cramped position, opening a file often
helps him/her to exchange pieces, so it makes sense to reduce
the space bonus if there are open files.
Credits: Leonardo Ljubičić for the strategic idea, Alain Savard for the
implementation of the open files calculation, "CrunchyNYC" for the
compensation of the numerator.
This greately simplifies usage because hides to the
search the implementation specific CheckInfo.
This is based on the work done by Marco in pull request #716,
implementing on top of it the ideas in the discussion: caching
the calls to slider_blockers() in the CheckInfo structure,
and simplifying the slider_blockers() function by removing its
first parameter.
Compared to master, bench is identical but the number of calls
to slider_blockers() during bench goes down from 22461515 to 18853422,
hopefully being a little bit faster overall.
archlinux, gcc-6
make profile-build ARCH=x86-64-bmi2
50 runs each
bench:
base = 2356320 +/- 981
test = 2403811 +/- 981
diff = 47490 +/- 1828
Marco Costalba [Fri, 19 Aug 2016 10:17:38 +0000 (12:17 +0200)]
Make engine ONE_PLY value independent
This non-functional change patch is a deep work to allow SF to be independent
from the actual value of ONE_PLY (currently set to 1). I have verified SF is
now independent for ONE_PLY values 1, 2, 4, 8, 16, 32 and 256.
This patch gives consistency to search code and enables future work, opening
the door to safely tweaking the ONE_PLY value for any reason.
Verified for no speed regression at STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 95643 W: 17728 L: 17737 D: 60178
Marco Costalba [Wed, 24 Aug 2016 14:52:05 +0000 (16:52 +0200)]
Reformat stats update
Rewritten in a way to have explicit in the search
the bonus/penalty we apply: hopefully this will lead
to further simplification/fix of current rather messy
stats update code.
Note: Note based off past experiments / patches... history pruning
is quite TC sensitive. I believe the reason for this TC dependency
is that the CMH/FMH is a very large table that takes time to fill
up with. In addition having more time for will increase the accuracy
of the stats' value.
Bonus for each square that we attack in the flank where the opponent
king is. Squares that we attack twice and are not protected by an enemy
pawn count double.
Consider a check given by a rook or a minor to be a "safe check"
also in the case where supported by another piece,
and given on a square only defended by a queen
VoyagerOne [Thu, 23 Jun 2016 14:08:43 +0000 (10:08 -0400)]
Comment out a redundant condition
Take advantage that VALUE_NONE = 32002 to remove
the condition.
Commented out and not removed becuase it is tricky
to rely on the hidden value of VALUE_NONE and code
can break in case we change VALUE_NONE in the future.
ajithcj [Fri, 10 Jun 2016 18:10:40 +0000 (18:10 +0000)]
Don't insert pv back into tt
This code was added before the accurate pv patch, when
we retrieved PV directly from TT.
It's not required for correct (and long) PVs any more and
should be safe to remove it.
Also, allowing helper threads to repeatedly over-write
TT doesn't seem to make sense(that was probably an un-intended
side-effect of lazy smp). Before Lazy SMP only Main Thread used
to run ID loop and insert PV into TT.
VoyagerOne [Mon, 6 Jun 2016 13:39:26 +0000 (09:39 -0400)]
Tweak check extension condition
There are two concepts with this patch:
Limit check extensions by using move count.
The idea is to limit search explosion.
Always extend check if the first move gives check.
The idea is to save expensive SEE calls, since the vast
majority of first move will have SEE value >= 0, also
first move may still be strong even if the SEE is negative.
Alain SAVARD [Sat, 4 Jun 2016 13:57:17 +0000 (09:57 -0400)]
Small Queen simplification
Moving a few lines from evaluate_threats to evaluate_pieces allows to
a) Remove a condition pos.count<QUEEN>(Them) == 1
b) use precalculated s instead of pos.square(Them)
c) do not check the condition at all in queenless endings
Marco Costalba [Sun, 29 May 2016 08:10:39 +0000 (10:10 +0200)]
Fix syzygy DTZ bug
In this position: 3K4/8/3k4/8/4p3/4B3/5P2/8 w - - 0 5
Current DTZ probe returns 1 instead of 15
What happens is that the double push f4 is erroneously detected as a win move.
After the push we have:
[D]3K4/8/3k4/8/4pP2/4B3/8/8 b - f3 0 5
And here the code misses the possible ep capture exf3.
The bug is in probe_dtz_no_ep() where is used probe_ab() that is
blind to ep captures so it returns v == 2 (win) for position
3K4/8/3k4/8/4pP2/4B3/8/8 b - f3 0 5
Note that at the caller site the original position did not have any
possible ep capture, so probe_dtz() returns immediately after calling
probe_dtz_no_ep().
The fix is to call the ep-aware probe_wdl() instead of probe_ab()
I have verified that DTZ is correct now and also there are no more
mistmatches compared to the new 'syzygy' branch. Tested on a set of
more than 600 endgame positions, included some tricky ones.
For people interested to redo the test or doing additional tests
please pull branch tb_dbg from https://github.com/mcostalba/Stockfish repo.
bench: 8450534 (bench unaffected because syzygy is not exercized during bench)
Alain SAVARD [Fri, 6 May 2016 13:40:56 +0000 (09:40 -0400)]
Unsafe checks
Introducing a new multi-purpose penalty related to King safety, which
includes all kind of potential checks (from unsafe or unavailable
squares currently occupied by some other piece)
This will indirectly detect and reward some pins, discovered checks, and
motifs such as square vacation, or rook behind its pawn and aligned with
King (example Black Rg8, g7 against Kg1),
and penalize some pawn blockers (if they move, it allows a discovered
check by the pawn).
And since it looks also at protected squares, it detects some potential
defense overloading.
Finally, the rook contact checks had been removed some time ago. This
test will give a small bonus for them, as well as for bishop contact
checks.
Currently, helper threads will only search up to the
specified depth limit. Now let them search until the
main thread has finished the specified depth.
On the other hand, we don't want to pick a thread with
a higher search depth.
I could not find anything documented that is necessary that prepending -mbmi to -mbmi2 gives some benefit.
Instead at
https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions
The following built-in functions are available when -mbmi is used. All of them generate the machine instruction that is part of the name.
unsigned int __builtin_ia32_bextr_u32(unsigned int, unsigned int);
unsigned long long __builtin_ia32_bextr_u64 (unsigned long long, unsigned long long);
The following built-in functions are available when -mbmi2 is used. All of them generate the machine instruction that is part of the name.
unsigned int _bzhi_u32 (unsigned int, unsigned int)
unsigned int _pdep_u32 (unsigned int, unsigned int)
unsigned int _pext_u32 (unsigned int, unsigned int)
unsigned long long _bzhi_u64 (unsigned long long, unsigned long long)
unsigned long long _pdep_u64 (unsigned long long, unsigned long long)
unsigned long long _pext_u64 (unsigned long long, unsigned long long)
and at
https://gcc.gnu.org/ml/gcc/2014-02/msg00204.html
( "... The real optimization comes from being able to use pext
(parallel bit extract), which can implement several bextr expressions in
parallel.")
Apart from that we don't use all -msse -msse2 -msse3 -msse4.2 etc. but just -msse3 (or -msse4.2) only.
As regards to the speedup within noise level - this pull request is actually reversal of mcostalba#198 wherein prepending -mbmi to -mbmi2 was claimed to be 0.3% faster and here (removing -mbmi) gives 0.4% speed gain.
Marco Costalba [Sun, 17 Apr 2016 19:31:19 +0000 (21:31 +0200)]
Fix incorrect draw detection
In this position we should have draw for repetition:
position fen rnbqkbnr/2pppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1 moves g1f3 g8f6 f3g1
go infinite
But latest patch broke it.
Actually we had two(!) very subtle bugs, the first is that Position::set()
clears the passed state and in particular 'previous' member, so
that on passing setupStates, 'previous' pointer was reset.
Second bug is even more subtle: SetupStates was based on std::vector
as container, but when vector grows, std::vector copies all its contents
to a new location invalidating all references to its entries. Because
all StateInfo records are linked by 'previous' pointer, this made pointers
go stale upon adding more element to setupStates. So revert to use a
std::deque that ensures references are preserved when pushing back new
elements.
Marco Costalba [Mon, 11 Apr 2016 14:45:36 +0000 (16:45 +0200)]
StateInfo is usually allocated on the stack by search()
And passed in do_move(), this ensures maximum efficiency and
speed and at the same time unlimited move numbers.
The draw back is that to handle Position init we need to
reserve a StateInfo inside Position itself and use at
init time and when copying from another Position.
After lazy SMP we don't need anymore this gimmick and we can
get rid of this special case and always pass an external
StateInfo to Position object.
Also rewritten and simplified Position constructors.
Verified it does not regress with a 3 threads SMP test:
ELO: -0.00 +-12.7 (95%) LOS: 50.0%
Total: 1000 W: 173 L: 173 D: 654
Fix last search info carried over to mate position
When starting search in a mate or stalemate position, Stockfish does not
even care to reinitialize and start worker threads. However after search
all threads are checked for the best move.
This can lead to bestmove and info beeing carried over from the last
search.
Example session:
setoption name threads value 7
go movetime 4000
position startpos moves f2f3 e7e5 g2g4 d8h4
go movetime 4000
There was already a penalty for squares only defended by King (undefended)
This test records a penalty for completely undefended squares in the so called extended king-ring
(so if we exclude squares defended by a Kg8 for example, we only look at h6 g6 and f6)
We also exclude squares occupied by opponent pieces in this computation,
based on the following results
Was yellow at STC
LLR: -2.97 (-2.94,2.94) [0.00,5.00]
Total: 112499 W: 20649 L: 20293 D: 71557
On top of the usual conditions
a) some opponent in front (but no lever)
b) some neighbours (in front) (but no neighbour behind or same rank)
c) < rank_5
to find out if a pawn is backward we look at the squares in front of this pawn to reach the same rank as the next neighbour.
In current master, a pawn is backward if any of those squares is controlled by an enemy pawn on an adjacent file
In this version, a pawn is ALSO backward if any of those squares is occupied by an enemy pawn.
Counter intuitively, make build ARCH=x86-32 does NOT produce a 32-bit compile
when running a 64-bit OS. Nor would ARCH=x86-64 produce a 64-bit compile when
running a 32-bit OS (assuming it compiled w/o errors).
lucasart [Tue, 29 Mar 2016 12:37:42 +0000 (20:37 +0800)]
Guard against UB in lsb/msb
lsb(b) and msb(b) are undefined when b == 0. This can lead to subtle bugs, where
the resulting code behaves differently on different configurations:
- It can be the home grown software LSB/MSB
- It can be the compiler generated software LSB/MSB (when using compiler
intrinsics without the right compiler flags to allow compiler to use hardware
LSB/MSB). Which of course depends on the compiler.
- It can be hardware LSB/MSB generated by the compiler.
- Not to mention that hardware LSB/MSB can return different value on different
hardware when b == 0.