Correct and extend PV lines with decisive TB score
Currently (after #5407), SF has the property that any PV line with a decisive
TB score contains the corresponding TB position, with a score that correctly
identifies the depth at which TB are entered. The PV line that follows might
not preserve the game outcome, but can easily be verified and extended based on
TB information. This patch provides this functionality, simply extending the PV
lines on output, this doesn't affect search.
Indeed, if DTZ tables are available, search based PV lines that correspond to
decisive TB scores are verified to preserve game outcome, truncating the line
as needed. Subsequently, such PV lines are extended with a game outcome
preserving line until mate, as a possible continuation. These lines are not
optimal mating lines, but are similar to what a user could produce on a website
like https://syzygy-tables.info/ clicking always the top ranked move, i.e.
minimizing or maximizing DTZ (with a simple tie-breaker for moves that have
identical DTZ), and are thus an just an illustration of how to game can be won.
A similar approach is already in established in
https://github.com/joergoster/Stockfish/tree/matefish2
This also contributes to addressing #5175 where SF can give an incorrect TB
win/loss for positions in TB with a movecounter that doesn't reflect optimal
play. While the full solution requires either TB generated differently, or a
search when ranking rootmoves, current SF will eventually find a draw in these
cases, in practice quite quickly, e.g.
`1kq5/q2r4/5K2/8/8/8/8/7Q w - - 96 1`
`8/8/6k1/3B4/3K4/4N3/8/8 w - - 54 106`
Gives the same results as master on an extended set of test positions from
https://github.com/mcostalba/Stockfish/commit/9173d29c414ddb8f4bec74e4db3ccbe664c66bf9
with the exception of the above mentioned fen where this commit improves.
With https://github.com/vondele/matetrack using 6men TB, all generated PVs verify:
```
Using ../Stockfish/src/stockfish.syzygyExtend on matetrack.epd with --nodes 1000000 --syzygyPath /chess/syzygy/3-4-5-6/WDL:/chess/syzygy/3-4-5-6/DTZ
Engine ID: Stockfish dev-20240704-ff227954
Total FENs: 6555
Found mates: 3299
Best mates: 2582
Found TB wins: 568
```
As repeated DTZ probing could be slow a procedure (100ms+ on HDD, a few ms on
SSD), the extension is only done as long as the time taken is less than half
the `Move Overhead` parameter. For tournaments where these lines might be of
interest to the user, a suitable `Move Overhead` might be needed (e.g. TCEC has
1000ms already).
To avoid output that depends on timing, output currmove and similar only from depth > 30
onward. Current choice of 3s makes the output of the same search depending on
the system load, and doesn't always start at move 1. Depth 30 is nowadays
reached in a few seconds on most systems.
Follow up from #5404 ... current location leads to troubles with Aquarium GUI
Fixes #5430
Now prints the information on threads and available processors at the beginning
of search, where info about the networks is already printed (and is known to
work)
Shahin M. Shahin [Mon, 24 Jun 2024 22:57:35 +0000 (01:57 +0300)]
Limit has_game_cycle() to only upcoming repetition
use the original algorithm according to the paper
http://web.archive.org/web/20201107002606/https://marcelk.net/2013-04-06/paper/upcoming-rep-v2.pdf,
which detects accurately if a position has an upcoming repetition. The 'no
progress' part of has_game_cycle has been removed, the function has been
renamed to upcoming_repetition to reflect this.
As a result of this fix, to the best of our knowledge, all PVs for completed
iterations that yield a mate or decisive table base score now end in mate or
contain a TB position, respectively.
This patch introduces history updates to probcut. Standard depth - 3 bonus and
maluses are given to the capture that caused fail high and previously searched
captures, respectively. Similar to #5243, a negative history fill is applied to
compensate for an increase in capture history average, thus improving the
scaling of this patch.
Shahin M. Shahin [Sun, 30 Jun 2024 13:32:20 +0000 (16:32 +0300)]
Increase reduction based on correct expectation
If the current node is not a cutNode then it means that the child is one in LMR
and the cutoff count is expected, so more reduction when the cutoffs are
expected
Michael Chaly [Sun, 30 Jun 2024 09:47:04 +0000 (12:47 +0300)]
Tweak multicut
This patch is an original patch by author of Altair
(https://github.com/Alex2262/AltairChessEngine) chess engine.
It allows to produce more aggressive multicut compared to master by changing
condition it needs to fulfil and also returns bigger value. Also has applied
matetrack fix on top.
FauziAkram [Sun, 16 Jun 2024 21:03:15 +0000 (00:03 +0300)]
Internal iterative reductions: decrease depth more
For PV nodes without a ttMove, we decrease depth.
But in this patch, additionally, if the current position is found in the TT, and the stored depth in the TT is greater than or equal to
the current search depth, we decrease the search depth even further.
Shawn Xu [Sun, 16 Jun 2024 23:14:22 +0000 (16:14 -0700)]
Simplify Bonus Formula In History Adjustment
Inspired by a discord message [1] from Vizvezdenec, this patch simplifies the
bonus adjustment bonus = bonus > 0 ? 2 * bonus : bonus / 2 to a constant
addition, maintaining bonus average at around 0 in regular bench. As cj5716
pointed in discord [2], the constant bonus can also be considered as factoring
tempo when calculating bonus, yielding a better value of the move.
xoto10 [Fri, 14 Jun 2024 16:27:09 +0000 (17:27 +0100)]
Consider wider range of moves near leaves.
try to avoid missing good moves for opponent or engine, by updating bestMove
also when value == bestValue (i.e. value == alpha) under certain conditions.
In particular require this is at higher depth in the tree, leaving the logic
near the root unchanged, and only apply randomly. Avoid doing this near mate
scores, leaving mate PVs intact.
Fix upperbound/lowerbound output in multithreaded case
In case a stop is received during multithreaded searches, the PV of the best
thread might be printed without the correct upperbound/lowerbound indicators.
This was due to the pvIdx variable being incremented after receiving the stop.
verifies that all mate PVs printed for finished iterations (i.e. no lower or upper bounds),
are complete, i.e. of the expected length and ending in mate, and do not contain drawing
or illegal moves.
based on a set of 2000 positions and the code in https://github.com/vondele/matetrack
currently extensions can cause depth to exceed MAX_PLY.
This triggers the assert near line 542 in search when running a binary compiled with `debug=yes` on a testcase like:
```
position fen 7K/P1p1p1p1/2P1P1Pk/6pP/3p2P1/1P6/3P4/8 w - - 0 1
go nodes 1000000
```
Disservin [Tue, 4 Jun 2024 20:29:27 +0000 (22:29 +0200)]
Move options into the engine
Move the engine options into the engine class, also avoid duplicated
initializations after startup. UCIEngine needs to register an add_listener to
listen to all option changes and print these. Also avoid a double
initialization of the TT, which was the case with the old state.
Dubslow [Mon, 10 Jun 2024 23:03:36 +0000 (18:03 -0500)]
Simplify TT interface and avoid changing TT info
This commit builds on the work and ideas of #5345, #5348, and #5364.
Place as much as possible of the TT implementation in tt.cpp, rather than in the
header. Some commentary is added to better document the public interface.
Fix the search read-TT races, or at least contain them to within TT methods only.
Tomasz Sobczyk [Thu, 6 Jun 2024 10:47:24 +0000 (12:47 +0200)]
NumaPolicy fixes and robustness improvements
1. Fix GetProcessGroupAffinity still not getting properly aligned memory
sometimes.
2. Fix a very theoretically possible heap corruption if
GetActiveProcessorGroupCount changes between calls.
3. Fully determine affinity on Windows 11 and Windows Server 2022. It
should only ever be indeterminate in case of an error.
4. Separate isDeterminate for old and new API, as they are &'d together
we still can end up with a subset of processors even if one API is
indeterminate.
5. likely_used_old_api() that is based on actual affinity that's been
detected
6. IMPORTANT: Gather affinities at startup, so that we only later use
the affinites set at startup. Not only does this prevent us from our
own calls interfering with detection but it also means subsequent
setoption NumaPolicy calls should behave as expected.
7. Fix ERROR_INSUFFICIENT_BUFFER from GetThreadSelectedCpuSetMasks being
treated like an error.
FauziAkram [Thu, 6 Jun 2024 10:07:45 +0000 (13:07 +0300)]
Tweak pruning formula
Tweak pruning formula, including a constant. I started from an old
yellow patch, if I'm not mistaken by Viz (Unfortunately I lost the link)
where he tried something similar.
I worked on it, trying different variations, until I came up with a good
configuration to pass.
rn5f107s2 [Wed, 5 Jun 2024 16:56:25 +0000 (18:56 +0200)]
Simplify razor changes
Remove razoring changes from
https://github.com/official-stockfish/Stockfish/pull/5360
The mentioned patch introduced the usage of futility_margin into
razoring alongside a tune to futility_margin. It seems the elo gained in
this patch comes from the tune of futility_margin and not the
introduction of futility_margin to razoring, so simplify it away here.
Disservin [Wed, 5 Jun 2024 16:31:11 +0000 (18:31 +0200)]
Update clang-format to version 18
clang-format-18 is available in ubuntu noble(24.04), if you are on
a version lower than that you can use the update script from llvm.
https://apt.llvm.org/
Windows users should be able to download and use clang-format from
their release builds https://github.com/llvm/llvm-project/releases
or get the latest from msys2
https://packages.msys2.org/package/mingw-w64-x86_64-clang.
macOS users can resort to "brew install clang-format".
Viren6 [Wed, 5 Jun 2024 02:24:39 +0000 (03:24 +0100)]
Use futility margin in razoring margin
Uses futilityMargin * depth to set the razoring margin. This retains the
quadratic depth scaling to preserve mate finding capabilities. This patch is
nice because it increases the elo sensitivity of the futility margin
heuristics.
Tomasz Sobczyk [Tue, 4 Jun 2024 10:48:13 +0000 (12:48 +0200)]
Add NumaPolicy "hardware" option that bypasses current processor affinity.
Can be used in case a GUI (e.g. ChessBase 17 see #5307) sets affinity to a
single processor group, but the user would like to use the full capabilities of
the hardware. Improves affinity handling on Windows in case of multiple
available APIs and existing affinities.
Recently when I overhauled these comments, Disservin asked why these
were so much lower: they're a relic from when we had a third QS stage at
-5. Now we don't, so fix these to the obvious place.
I was fairly sure it was nonfunctional but ran the nonreg to be double
sure.
Disservin [Fri, 31 May 2024 08:53:10 +0000 (10:53 +0200)]
Add helpers for managing aligned memory
Previously, we had two type aliases, LargePagePtr and AlignedPtr, which
required manually initializing the aligned memory for the pointer.
The new helpers:
- make_unique_aligned
- make_unique_large_page
are now available for allocating aligned memory (with large pages). They
behave similarly to std::make_unique, ensuring objects allocated with
these functions follow RAII.
The old approach had issues with initializing non-trivial types or
arrays of objects. The evaluation function of the network is now a
unique pointer to an array instead of an array of unique pointers.
Memory related functions have been moved into memory.h
Michael Chaly [Sat, 1 Jun 2024 17:44:06 +0000 (20:44 +0300)]
Adjust return bonus from tt cutoffs at fail highs
This is reintroduction of the recently simplified logic - if positive tt cutoff
occurs return not a tt value but smth between it and beta. Difference is that
instead of static linear combination there we use basically the same formula as
we do in the main search - with the only difference being using tt depth
instead of depth, which makes a lot of sense.
Created by training L1-128 from scratch with:
- skipping based on simple eval in the trainer, for compatibility with
regular binpacks without requiring pre-filtering all binpacks
- minimum simple eval of 950, lower than 1000 previously
- usage of some hse-v1 binpacks with minimum simple eval 1000
- addition of hse-v6 binpacks with minimum simple eval 500
- permuting the FT with 10k positions from fishpack32.binpack
- torch.compile to speed up smallnet training
Training is significantly slower when using non-pre-filtered binpacks due to
the increased skipping required.
rn5f107s2 [Thu, 30 May 2024 19:18:42 +0000 (21:18 +0200)]
MCP more after a bad singular search
The idea is, that if we have the information that the singular search failed low and therefore produced an upperbound score, we can use the score from singularsearch as approximate upperbound as to what bestValue our non ttMoves will produce. If this value is well below alpha, we assume that all non-ttMoves will score below alpha and therfore can skip more moves.
This patch also sets up variables for future patches wanting to use teh singular search result outside of singular extensions, in singularBound and singularValue, meaning further patches using this search result to affect various pruning techniques can be tried.
FauziAkram [Fri, 31 May 2024 01:01:02 +0000 (04:01 +0300)]
Tweak first picked move (ttMove) reduction rule
Tweak first picked move (ttMove) reduction rule:
Instead of always resetting the reduction to 0, we now only do so if the current reduction is less than 2.
If the current reduction is 2 or more, we decrease it by 2 instead.
Tomasz Sobczyk [Thu, 30 May 2024 10:56:44 +0000 (12:56 +0200)]
Fix process' processor affinity determination on Windows.
Specialize and privatize NumaConfig::get_process_affinity.
Only enable NUMA capability for 64-bit Windows.
Following #5307 and some more testing it was determined that the way affinity
was being determined on Windows was incorrect, based on incorrect assumptions
about GetNumaProcessorNodeEx.
This patch fixes the issue by attempting to retrieve the actual process'
processor affinity using Windows API. However one issue persists that is not
addressable due to limitations of Windows, and will have to be considered a
limitation. If affinities were set using SetThreadAffinityMask instead of
SetThreadSelectedCpuSetMasks and GetProcessGroupAffinity returns more than 1
group it is NOT POSSIBLE to determine the affinity programmatically on Windows.
In such case the implementation assumes no affinites are set and will consider
all processors available for execution.
Michael Chaly [Thu, 30 May 2024 16:27:12 +0000 (19:27 +0300)]
Allow tt cutoffs for shallower depths in certain conditions
Current master allows tt cutoffs only when depth
from tt is strictly greater than current node
depth. This patch also allows them when it's equal
and if tt value is lower or equal to beta.
This PR updates the internal WDL model, using data from 2.5M games played by SF-dev (3c62ad7).
Note that the normalizing constant has increased from 329 to 368.
Changes to the fitting procedure:
* the value for --materialMin was increased from 10 to 17: including data with less material leads to less accuracy for larger material count values
* the data was filtered to only include single thread LTC games at 60+0.6
* the data was filtered to only include games from master against patches that are (approximatively) within 5 nElo of master
For more information and plots of the model see PR#5309
Created by further tuning the spsa-tuned main net `nn-c721dfca8cd3.nnue`
with the same methods described in https://github.com/official-stockfish/Stockfish/pull/5254
This net was reached at 61k / 120k spsa games at 70+0.7 th 7:
https://tests.stockfishchess.org/tests/view/665639d0a86388d5e27dd259
xoto10 [Tue, 28 May 2024 18:40:40 +0000 (19:40 +0100)]
Add compensation factor to adjust extra time according to time control
As stockfish nets and search evolve, the existing time control appears
to give too little time at STC, roughly correct at LTC, and too little
at VLTC+.
This change adds an adjustment to the optExtra calculation. This
adjustment is easy to retune and refine, so it should be easier to keep
up-to-date than the more complex calculations used for optConstant and
optScale.
This simplification patch merges the pawn count terms in the eval
formula with the material term, updating the offset constant for
the nnue part of the formula from 34000 to 34300 because the average
pawn count in middlegame positions evaluated during search is around 8.
Tomasz Sobczyk [Fri, 17 May 2024 10:10:31 +0000 (12:10 +0200)]
Improve performance on NUMA systems
Allow for NUMA memory replication for NNUE weights. Bind threads to ensure execution on a specific NUMA node.
This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.
Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.
-----------------
A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] -
// ':'-separated numa nodes
// ','-separated cpu indices
// supports "first-last" range syntax for cpu indices,
for example '0-15,32-47:16-31,48-63'
```
Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.
The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.
Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.
On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.
-----------------
We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.
MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.
Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.
Linux is supported, **without** libnuma requirement. `lscpu` is expected.
Linmiao Xu [Fri, 24 May 2024 14:58:13 +0000 (10:58 -0400)]
Lower smallnet threshold with tuned eval params
The smallnet threshold is now below the training data range
of the current smallnet (simple eval diff > 1k, nn-baff1edelf90.nnue)
when no pawns are on the board.
Params found with spsa at 93k / 120k games at 60+06:
https://tests.stockfishchess.org/tests/view/664fa166a86388d5e27d7d6b
Tuned on top of: https://github.com/official-stockfish/Stockfish/pull/5287