I could not find anything documented that is necessary that prepending -mbmi to -mbmi2 gives some benefit.
Instead at
https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions
The following built-in functions are available when -mbmi is used. All of them generate the machine instruction that is part of the name.
unsigned int __builtin_ia32_bextr_u32(unsigned int, unsigned int);
unsigned long long __builtin_ia32_bextr_u64 (unsigned long long, unsigned long long);
The following built-in functions are available when -mbmi2 is used. All of them generate the machine instruction that is part of the name.
unsigned int _bzhi_u32 (unsigned int, unsigned int)
unsigned int _pdep_u32 (unsigned int, unsigned int)
unsigned int _pext_u32 (unsigned int, unsigned int)
unsigned long long _bzhi_u64 (unsigned long long, unsigned long long)
unsigned long long _pdep_u64 (unsigned long long, unsigned long long)
unsigned long long _pext_u64 (unsigned long long, unsigned long long)
and at
https://gcc.gnu.org/ml/gcc/2014-02/msg00204.html
( "... The real optimization comes from being able to use pext
(parallel bit extract), which can implement several bextr expressions in
parallel.")
Apart from that we don't use all -msse -msse2 -msse3 -msse4.2 etc. but just -msse3 (or -msse4.2) only.
As regards to the speedup within noise level - this pull request is actually reversal of mcostalba#198 wherein prepending -mbmi to -mbmi2 was claimed to be 0.3% faster and here (removing -mbmi) gives 0.4% speed gain.
Marco Costalba [Sun, 17 Apr 2016 19:31:19 +0000 (21:31 +0200)]
Fix incorrect draw detection
In this position we should have draw for repetition:
position fen rnbqkbnr/2pppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1 moves g1f3 g8f6 f3g1
go infinite
But latest patch broke it.
Actually we had two(!) very subtle bugs, the first is that Position::set()
clears the passed state and in particular 'previous' member, so
that on passing setupStates, 'previous' pointer was reset.
Second bug is even more subtle: SetupStates was based on std::vector
as container, but when vector grows, std::vector copies all its contents
to a new location invalidating all references to its entries. Because
all StateInfo records are linked by 'previous' pointer, this made pointers
go stale upon adding more element to setupStates. So revert to use a
std::deque that ensures references are preserved when pushing back new
elements.
Marco Costalba [Mon, 11 Apr 2016 14:45:36 +0000 (16:45 +0200)]
StateInfo is usually allocated on the stack by search()
And passed in do_move(), this ensures maximum efficiency and
speed and at the same time unlimited move numbers.
The draw back is that to handle Position init we need to
reserve a StateInfo inside Position itself and use at
init time and when copying from another Position.
After lazy SMP we don't need anymore this gimmick and we can
get rid of this special case and always pass an external
StateInfo to Position object.
Also rewritten and simplified Position constructors.
Verified it does not regress with a 3 threads SMP test:
ELO: -0.00 +-12.7 (95%) LOS: 50.0%
Total: 1000 W: 173 L: 173 D: 654
Fix last search info carried over to mate position
When starting search in a mate or stalemate position, Stockfish does not
even care to reinitialize and start worker threads. However after search
all threads are checked for the best move.
This can lead to bestmove and info beeing carried over from the last
search.
Example session:
setoption name threads value 7
go movetime 4000
position startpos moves f2f3 e7e5 g2g4 d8h4
go movetime 4000
There was already a penalty for squares only defended by King (undefended)
This test records a penalty for completely undefended squares in the so called extended king-ring
(so if we exclude squares defended by a Kg8 for example, we only look at h6 g6 and f6)
We also exclude squares occupied by opponent pieces in this computation,
based on the following results
Was yellow at STC
LLR: -2.97 (-2.94,2.94) [0.00,5.00]
Total: 112499 W: 20649 L: 20293 D: 71557
On top of the usual conditions
a) some opponent in front (but no lever)
b) some neighbours (in front) (but no neighbour behind or same rank)
c) < rank_5
to find out if a pawn is backward we look at the squares in front of this pawn to reach the same rank as the next neighbour.
In current master, a pawn is backward if any of those squares is controlled by an enemy pawn on an adjacent file
In this version, a pawn is ALSO backward if any of those squares is occupied by an enemy pawn.
Counter intuitively, make build ARCH=x86-32 does NOT produce a 32-bit compile
when running a 64-bit OS. Nor would ARCH=x86-64 produce a 64-bit compile when
running a 32-bit OS (assuming it compiled w/o errors).
lucasart [Tue, 29 Mar 2016 12:37:42 +0000 (20:37 +0800)]
Guard against UB in lsb/msb
lsb(b) and msb(b) are undefined when b == 0. This can lead to subtle bugs, where
the resulting code behaves differently on different configurations:
- It can be the home grown software LSB/MSB
- It can be the compiler generated software LSB/MSB (when using compiler
intrinsics without the right compiler flags to allow compiler to use hardware
LSB/MSB). Which of course depends on the compiler.
- It can be hardware LSB/MSB generated by the compiler.
- Not to mention that hardware LSB/MSB can return different value on different
hardware when b == 0.
VoyagerOne [Sun, 14 Feb 2016 19:45:10 +0000 (19:45 +0000)]
Fix futility pruning bug
PredictedDepth can be negative, causing the futility_margin to be negative.
It will be very difficult to tweak moveCount pruning and reduction formula, as they are tuned to prevent this behavior.
lucasart [Mon, 18 Jan 2016 22:20:16 +0000 (22:20 +0000)]
Retire RootNode template
There is no reason to compile 3 different copies of search(). PV nodes are on
the cold path, and PvNode is a template parameter, so there is no cost in
computing:
const bool RootNode = PvNode && (ss-1)->ply == 0;
And this simplifies code a tiny bit as well.
Speed impact is negligible on my machine (i7-3770k, linux 4.2, gcc 5.2):
Marco Costalba [Wed, 23 Dec 2015 09:07:54 +0000 (10:07 +0100)]
Move some globals into main thread scope
Make it explicit that those variables are not globals, but
are used only by main thread. I think it is a sensible
clarification because easy move is already tricky enough
and current patch makes the involved actors explicit.
Alain SAVARD [Tue, 22 Dec 2015 10:00:04 +0000 (10:00 +0000)]
New mobility bonus
Tuned the global mobility factor for each piece, as well as some +- delta,
The master mobility factor was {266,334} and tuning gave
{267, 362} +S(-2,-2) for the Knight
{249, 328} +S( 0,-2) for the Bishop
{298, 353} +S(1,1) for the Rook
{265, 358} +S(2,-1) for the Queen
mstembera [Mon, 14 Dec 2015 18:34:43 +0000 (18:34 +0000)]
Fix easy move bug in SMP mode
Fix a bug where we could stop the search after only 10% of time used due to a matching easy move but later switch to a different move that was never pre-screened as easy due to SMP thread select.
Leonid Pechenik [Wed, 9 Dec 2015 07:07:34 +0000 (02:07 -0500)]
Simplify time management and fix 'ponder on' bug
Simplify time management code by removing hard stops for unchanging first root moves.
Search is now stopped earlier at the end iteration if it did not have fail-lows at root.
This simplification also fixes pondering bug. Ponder flag was true by default
and cutechess-cli doesn't change it to false even though no pondering is possible.
Fix the issue by setting the default value of 'Ponder' flag to false.
mstembera [Mon, 16 Nov 2015 23:39:11 +0000 (15:39 -0800)]
Clean up RootMove less operator
This is used by std::stable_sort() to sort moves from highest score to lowest score.
1) The comment is incorrect since highest to lowest means descending.
2) It's more natural to implement a less operator using another less operator rather than a greater operator.
lucasart [Sat, 21 Nov 2015 07:23:53 +0000 (23:23 -0800)]
Fix TT comment and static_assert()
Comment is based on a misunderstanding of what unaligned memory access is. Here
is an article that explains it very clearly:
https://www.kernel.org/doc/Documentation/unaligned-memory-access.txt
No matter how we define TTEntry or TTCluster, there will never be any unaligned
memory access. This is because the complier knows the alignment rules, and does
the necessary adjustments to make sure unaligned memory access does not occur.
The issue being adressed here has nothing to do with unaligned memory access. It
is about cache performance. In order to achieve best cache performance:
- we prefetch the cacheline as soon as possible.
- we ensure that TT clusters do not spread across two cachelines. If they did,
we would need to prefetch 2 cachelines, which could hurt cache performance.
Therefore the true conditions to achieve this are:
1/ start adress of TT is cache line aligned. void TranspositionTable::resize()
enforces this.
2/ TT cluster size should *divide* the cache line size. Currently, we pack 2
clusters per cache lines. It used to be 1 before "TT sardines". Does not matter
what the ratio is, all we want is to fit an integer number of clusters per cache
line.
Marco Costalba [Sat, 21 Nov 2015 06:48:50 +0000 (07:48 +0100)]
Rewrite how threads are spawned
Instead of creating a running std::thread and
returning, wait in Thread c'tor that the native
thread of execution goes to sleep in idle_loop().
In this way we can simplify how search is started,
because when main thread is idle we are sure also
all other threads will be idle, in any case, even
at thread creation and startup.
After lazy smp went in, we can simpify and rewrite
a lot of logic that is now no more needed. This is
hopefully the final big cleanup.
Tested for no regression at 5+0.1 with 3 threads:
LLR: 2.95 (-2.94,2.94) [-5.00,0.00]
Total: 17411 W: 3248 L: 3198 D: 10965
Marco Costalba [Mon, 9 Nov 2015 08:50:02 +0000 (09:50 +0100)]
Fix broken UCI 'wait for stop'
When we reach the maximum depth, we can finish the
search without a raise of Signals.stop. However, if
we are pondering or in an infinite search, the UCI
protocol states that we shouldn't print the best move
before the GUI sends a "stop" or "ponderhit" command.
It was broken by lazy smp. Fix it by moving the stopping
of the threads after waiting for GUI.
Marco Costalba [Tue, 3 Nov 2015 10:15:14 +0000 (11:15 +0100)]
Get rid of timer thread
Unfortunately std::condition_variable::wait_for()
is not accurate in general case and the timer thread
can wake up also after tens or even hundreds of
millisecs after time has elapsded. CPU load, process
priorities, number of concurrent threads, even from
other processes, will have effect upon it.
Even official documentation says: "This function may
block for longer than timeout_duration due to scheduling
or resource contention delays."
So retire timer and use a polling scheme based on a
local thread counter that counts search() calls and
a small trick to keep polling frequency constant,
independently from the number of threads.
Tested for no regression at very fast TC 2+0.05 th 7:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 32969 W: 6720 L: 6620 D: 19629
Raising a pull request early as 24 Thread tests are very expensive and
this is clearly a positive gain at high thread counts and high time
controls. The change is a small parameter tweak with no additional
logic.
lucasart [Sat, 24 Oct 2015 21:50:51 +0000 (22:50 +0100)]
Use atomics instead of volatile
Rely on well defined behaviour for message passing, instead of volatile. Three
versions have been tested, to make sure this wouldn't cause a slowdown on any
platform.
v1: Sequentially consistent atomics
No mesurable regression, despite the extra memory barriers on x86. Even with 15
threads and extreme time pressure, both acting as a magnifying glass: