xoto10 [Tue, 28 May 2024 18:40:40 +0000 (19:40 +0100)]
Add compensation factor to adjust extra time according to time control
As stockfish nets and search evolve, the existing time control appears
to give too little time at STC, roughly correct at LTC, and too little
at VLTC+.
This change adds an adjustment to the optExtra calculation. This
adjustment is easy to retune and refine, so it should be easier to keep
up-to-date than the more complex calculations used for optConstant and
optScale.
This simplification patch merges the pawn count terms in the eval
formula with the material term, updating the offset constant for
the nnue part of the formula from 34000 to 34300 because the average
pawn count in middlegame positions evaluated during search is around 8.
Tomasz Sobczyk [Fri, 17 May 2024 10:10:31 +0000 (12:10 +0200)]
Improve performance on NUMA systems
Allow for NUMA memory replication for NNUE weights. Bind threads to ensure execution on a specific NUMA node.
This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.
Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.
-----------------
A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] -
// ':'-separated numa nodes
// ','-separated cpu indices
// supports "first-last" range syntax for cpu indices,
for example '0-15,32-47:16-31,48-63'
```
Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.
The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.
Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.
On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.
-----------------
We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.
MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.
Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.
Linux is supported, **without** libnuma requirement. `lscpu` is expected.
Linmiao Xu [Fri, 24 May 2024 14:58:13 +0000 (10:58 -0400)]
Lower smallnet threshold with tuned eval params
The smallnet threshold is now below the training data range
of the current smallnet (simple eval diff > 1k, nn-baff1edelf90.nnue)
when no pawns are on the board.
Params found with spsa at 93k / 120k games at 60+06:
https://tests.stockfishchess.org/tests/view/664fa166a86388d5e27d7d6b
Tuned on top of: https://github.com/official-stockfish/Stockfish/pull/5287
Muzhen Gaming [Thu, 23 May 2024 00:28:46 +0000 (08:28 +0800)]
VVLTC search tune
Parameters were tuned in 2 stages:
1. 127k games at VVLTC:
https://tests.stockfishchess.org/tests/view/6649f8dfb8fa20e74c39f52a.
2. 106k games at VVLTC:
https://tests.stockfishchess.org/tests/view/664bfb77830eb9f886615a9d.
Muzhen Gaming [Wed, 22 May 2024 01:09:04 +0000 (09:09 +0800)]
Revert "Reduce When TTValue is Above Alpha"
The patch regressed significantly at longer time controls. In
particular, the `depth--` behavior was predicted to scale badly based on
data from other variations of the patch.
cj5716 [Sun, 19 May 2024 05:15:42 +0000 (13:15 +0800)]
Optimise pairwise multiplication
This speedup was first inspired by a comment by @AndyGrant on my recent
PR "If mullo_epi16 would preserve the signedness, then this could be
used to remove 50% of the max operations during the halfkp-pairwise
mat-mul relu deal."
That got me thinking, because although mullo_epi16 did not preserve the
signedness, mulhi_epi16 did, and so we could shift left and then use
mulhi_epi16, instead of shifting right after the mullo.
However, due to some issues with shifting into the sign bit, the FT
weights and biases had to be multiplied by 2 for the optimisation to
work.
Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom
Result of 50 runs
base (...es/stockfish) = 962946 +/- 1202
test (...ise-max-less) = 979696 +/- 1084
diff = +16750 +/- 1794
speedup = +0.0174
P(speedup > 0) = 1.0000
CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on
Also a speedup on "COMP=gcc", courtesy of Torom once again
Result of 50 runs
base (...tockfish_gcc) = 966033 +/- 1574
test (...max-less_gcc) = 983319 +/- 1513
diff = +17286 +/- 2515
speedup = +0.0179
P(speedup > 0) = 1.0000
CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on
Viren6 [Sun, 19 May 2024 01:58:01 +0000 (02:58 +0100)]
Addition of new scaling comments
This patch is intended to prevent patches like 9b90cd8 and the
subsequent reversion e3c9ed7 from happening again. Scaling behaviour of
the reduction adjustments in the non-linear scaling
section have been proven to >8 sigma:
The else if condition is moved to the non scaling section based on:
https://tests.stockfishchess.org/tests/view/664567a193ce6da3e93b3232 (It
has no proven scaling)
General comment improvements and removal of a redundant margin condition
have also been included.
Also "fix" movepicker to allow depths between CHECKS and NO_CHECKS,
which makes them easier to tweak (not that they get tweaked hardly ever)
(This was more beneficial when there was a third stage to DEPTH_QS, but
it's still an improvement now)
Second spsa with 30k / 120k games at 60+0.6:
https://tests.stockfishchess.org/tests/view/664be227830eb9f886615a36
Values found at 10k games at 60+0.6 also passed STC and LTC:
https://tests.stockfishchess.org/tests/view/664bf4bd830eb9f886615a72
https://tests.stockfishchess.org/tests/view/664c0905830eb9f886615abf
FauziAkram [Mon, 20 May 2024 23:19:54 +0000 (02:19 +0300)]
Refine Evaluation Scaling with Piece-Specific Weights
Refine Evaluation Scaling with Piece-Specific Weights, instead of the simplified npm method.
I took the initial idea from Viren6 , as he worked on it in September of last year.
I worked on it, and tuned it, and now it passed both tests.
Michael Chaly [Mon, 20 May 2024 21:40:55 +0000 (00:40 +0300)]
Rescale pawn history updates
This patch is somewhat of a continuation of recent pawn history gainers.
It makes pawn history updates after search twice smaller. Since on average they make pawn history more negative offset is changed to lower value to remain average value approximately the same.
Michael Chaly [Mon, 20 May 2024 00:22:40 +0000 (03:22 +0300)]
Update correction history in case of successful null move pruning
Since null move pruning uses the same position it makes some sense to try to update correction history there in case of fail high.
Update value is 4 times less than normal update.
Linmiao Xu [Sun, 19 May 2024 18:01:49 +0000 (14:01 -0400)]
Re-eval only if smallnet output flips from simple eval
Recent attempts to change the smallnet nnue re-eval
threshold did not show much elo difference:
https://tests.stockfishchess.org/tests/view/664a29bb25a9058c4d21d53c
https://tests.stockfishchess.org/tests/view/664a299925a9058c4d21d53a
Michael Chaly [Sun, 19 May 2024 15:48:43 +0000 (18:48 +0300)]
Do more aggressive pawn history updates
Tweak of recent patch that made pawn history to update for move that caused a fail low - and setting up default value of it to -900. This patch makes it more aggressive - twice bigger updates and default value -1100.
Tweak continuation history bonus dependent on ply.
This patch is based on following tuning https://tests.stockfishchess.org/tests/view/6648b2eb308cceea45533abe by only using the tuned factors for the continuation history.
The unusual result of (combined) +12.0 +- 3.7 in the 2 VVLTC simplification SPRTs ran was the result of base having only 64MB of hash instead of 512MB (Asymmetric hash).
Vizvezdenec was the one to notice this.
FauziAkram [Fri, 17 May 2024 22:22:41 +0000 (01:22 +0300)]
Early Exit in Bitboards::sliding_attack()
he original code checks for occupancy within the loop condition. By moving this check inside the loop and adding an early exit condition, we can avoid unnecessary iterations if a blocking piece is encountered.
Created by first retraining the spsa-tuned main net `nn-ae6a388e4a1a.nnue` with:
- using v6-dd data without bestmove captures removed
- addition of T80 mar2024 data
- increasing loss by 20% when Q is too high
- torch.compile changes for marginal training speed gains
And then SPSA tuning weights of epoch 899 following methods described in:
https://github.com/official-stockfish/Stockfish/pull/5149
This net was reached at 92k out of 120k steps in this 70+0.7 th 7 SPSA tuning run:
https://tests.stockfishchess.org/tests/view/66413b7df9f4e8fc783c9bbb
Thanks to @Viren6 for suggesting usage of:
- c value 4 for the weights
- c value 128 for the biases
Scripts for automating applying fishtest spsa params to exporting tuned .nnue are in:
https://github.com/linrock/nnue-tools/tree/master/spsa
Reduce more when improving and ttvalue is lower than alpha
More reduction if position is improving but value from TT doesn't
exceeds alpha but the tt move is excluded.
This idea is based on following LMR condition tuning
https://tests.stockfishchess.org/tests/view/66423a1bf9f4e8fc783cba37
by using only three of the four largest terms P[3], P[18] and P[12].
Michael Chaly [Tue, 14 May 2024 17:10:01 +0000 (20:10 +0300)]
Add extra bonus to pawn history for a move that caused a fail low
Basically the same idea as it is for continuation/main history, but it
has some tweaks.
1) it has * 2 multiplier for bonus instead of full/half bonus - for
whatever reason this seems to work better;
2) attempts with this type of big bonuses scaled somewhat poorly (or
were unlucky at longer time controls), but after measuring the fact
that average value of pawn history in LMR after adding this bonuses
increased by substantial number (for multiplier 1,5 it increased by
smth like 400~ from 8192 cap) attempts were made to make default pawn
history negative to compensate it - and version with multiplier 2 and
initial fill value -900 passed.
xoto10 [Mon, 13 May 2024 06:19:18 +0000 (07:19 +0100)]
Use 5% less time on first move
Stockfish appears to take too much time on the first move of a game and
then not enough on moves 2,3,4... Probably caused by most of the factors
that increase time usually applying on the first move.
Attempts to give more time to the subsequent moves have not worked so
far, but this change to simply reduce first move time by 5% worked.
Linmiao Xu [Thu, 9 May 2024 18:03:35 +0000 (14:03 -0400)]
Re-evaluate some small net positions for more accurate evals
Use main net evals when small net evals hint that higher eval
accuracy may be worth the slower eval speeds. With Finny caches,
re-evals with the main net are less expensive than before.
Original idea by mstembera who I've added as co-author to this PR.
Shawn Xu [Wed, 8 May 2024 21:26:01 +0000 (14:26 -0700)]
Simplify Away Negative Extension
This patch simplifies away the negative extension applied when the value returned by the transposition table is assumed to fail low over the value of reduced search.
Michael Chaly [Mon, 6 May 2024 17:18:12 +0000 (20:18 +0300)]
Simplify away conthist 3 from statscore
Following previous elo gainer that gained by making conthist 3 less important in pruning this patch simplifies away this history from calculation of statscore.
MinetaS [Tue, 7 May 2024 18:26:09 +0000 (03:26 +0900)]
Fix nodestime
1. The current time management system utilizes limits.inc and
limits.time, which can represent either milliseconds or node count,
depending on whether the nodestime option is active. There have been
several modifications which brought Elo gain for typical uses (i.e.
real-time matches), however some of these changes overlooked such
distinction. This patch adjusts constants and multiplication/division to
more accurately simulate real TC conditions when nodestime is used.
2. The advance_nodes_time function has a bug that can extend the time
limit when availableNodes reaches exact zero. This patch fixes the bug
by initializing the variable to -1 and make sure it does not go below
zero.
3. elapsed_time function is newly introduced to print PV in the UCI
output based on real time. This makes PV output more consistent with the
behavior of trivial use cases.
rn5f107s2 [Wed, 8 May 2024 20:08:56 +0000 (22:08 +0200)]
IIR on cutnodes if there is a ttMove but the ttBound is upper
If there is an upper bound stored in the transposition table, but we still have a ttMove, the upperbound indicates that the last time the ttMove was tried, it failed low. This fail low indicates that the ttMove may not be good, so this patch introduces a depth reduction of one for cutnodes with such ttMoves.
Michael Chaly [Wed, 8 May 2024 18:59:03 +0000 (21:59 +0300)]
Refactor quiet moves pruning in qsearch
Make it formula more in line with what we use in search - current formula is more or less the one we used years ago for search but since then it was remade, this patch remakes qsearch formula to almost exactly the same as we use in search - with sum of conthist 0, 1 and pawn structure history.
FauziAkram [Tue, 7 May 2024 12:03:58 +0000 (15:03 +0300)]
Tweak reduction formula based on depth
The idea came to me by checking for trends from the megafauzi tunes, since the values of the divisor for this specific formula were as follows:
stc: 15990
mtc: 16117
ltc: 14805
vltc: 12719
new vltc passed by Muzhen: 12076
This shows a clear trend related to time control, the higher it is, the lower the optimum value for the divisor seems to be.
So I tried a simple formula, using educated guesses based on some calculations, tests show it works pretty fine, and it can still be further tuned at VLTC in the future to scale even better.
Muzhen Gaming [Sat, 4 May 2024 23:36:48 +0000 (07:36 +0800)]
VVLTC search tune
This patch is the result of two tuning stages:
1. ~32k games at 60+0.6 th8:
https://tests.stockfishchess.org/tests/view/662d9dea6115ff6764c7f817
2. ~193k games at 80+0.8 th6, based on PR #5211:
https://tests.stockfishchess.org/tests/view/663587e273559a8aa857ca00.
Based on extensive VVLTC tuning and testing both before and after
#5211, it is observed that introduction of new extensions positively
affected the search tune results.
cj5716 [Wed, 1 May 2024 10:31:38 +0000 (18:31 +0800)]
Some history fixes and tidy-up
This adds the functions `update_refutations` and `update_quiet_histories` to better distinguish the two. `update_quiet_stats` now just calls both of these functions.
The functional side of this patch is two-fold:
1. Stop refutations being updated when we carry out multicut
2. Update pawn history every time we update other quiet histories
Viren6 [Sat, 4 May 2024 16:29:23 +0000 (17:29 +0100)]
Introduce Quadruple Extensions
This patch introduces quadruple extensions, with the new condition of not ttPv. It also generalises all margins, so that extensions can still occur if conditions are only partially fulfilled, but with a stricter margin.
Michael Chaly [Sat, 4 May 2024 07:33:26 +0000 (10:33 +0300)]
Add extra bonuses to some moves that forced a fail low
The previous patch on this idea was giving bonuses to this moves if best value of search is far below current static evaluation.
This patch does similar thing but adds extra bonus when best value of search is far below static evaluation before previous move.
1) Fixes a bug introduced in
https://github.com/official-stockfish/Stockfish/pull/5194. Only one
psqtOnly flag was used for two perspectives which was causing
wrong entries to be cleared and marked.
2) The finny caches should be cleared like histories and not at the
start of every search.
Saves a (currently) 800 KB allocation and deallocation when running
`eval`, not particularly significant and zero impact on play but not
necessary either.
Use capture history to better judge which sacrifices to explore
This idea has been bouncing around a while. @Vizvezdenec tried it a
couple years ago in Stockfish without results, but its recent arrival in
Ethereal inspired him and thence me to try it afresh in Stockfish.
(Also factor out the now-common code with futpruning for captures.)
More reduction at cut nodes which are not a former PV node
But the tt move and first killer are excluded.
This idea is based on following LMR condition tuning
https://tests.stockfishchess.org/tests/view/66228bed3fe04ce4cefc0c71 by
using only the two largest terms P[0] and P[1].
For each thread persist an accumulator cache for the network, where each
cache contains multiple entries for each of the possible king squares.
When the accumulator needs to be refreshed, the cached entry is used to more
efficiently update the accumulator, instead of rebuilding it from scratch.
This idea, was first described by Luecx (author of Koivisto) and
is commonly referred to as "Finny Tables".
When the accumulator needs to be refreshed, instead of filling it with
biases and adding every piece from scratch, we...
1. Take the `AccumulatorRefreshEntry` associated with the new king bucket
2. Calculate the features to activate and deactivate (from differences
between bitboards in the entry and bitboards of the actual position)
3. Apply the updates on the refresh entry
4. Copy the content of the refresh entry accumulator to the accumulator
we were refreshing
5. Copy the bitboards from the position to the refresh entry, to match
the newly updated accumulator
Parameters Tune, adding also another tunable parameter (npmDiv) to be
variable for different nets (bignet, smallnet, psqtOnly smallnet). P.s:
The changed values are only the parameters where there is agreement
among the different time controls, so in other words, the tunings are
telling us that changing these specific values to this specific
direction is good in all time controls, so there shouldn't be a high
risk of regressing at longer time controls.
Previously it was possible to also get the node counter after running a bench with perft, i.e.
`./stockfish bench 1 1 5 current perft`, caused by a small regression from the uci refactoring.
We change the definition of "age" from "age of this position" to "age of this TT entry".
In this way, despite being on the same position, when we save into TT, we always prefer the new entry as compared to the old one.
It makes more sense to not (potentially) change the developers alsr entropy setting to make the test run through. This should be an active choice even if the test then might fail locally for them.
the recent refactoring has shown some limitations of our testing, hence we add a couple of more tests including:
* expected mate score
* expected mated score
* expected in TB win score
* expected in TB loss score
* expected info line output
* expected info line output (wdl)