Alayan Feh (Alayan-stk-2)
Alexander Kure
Alexander Pagel (Lolligerhans)
+Alfredo Menezes (lonfom169)
Ali AlZhrani (Cooffe)
Andrew Grant (AndyGrant)
Andrey Neporada (nepal)
Dariusz Orzechowski (dorzechowski)
David Zar
Daylen Yang (daylen)
+Deshawn Mohan-Smith (GoldenRare)
DiscanX
Dominik Schlösser (domschl)
double-beep
Jerry Donald Watson (jerrydonaldwatson)
jjoshua2
Jonathan Calovski (Mysseno)
-Jonathan Dumale (SFisGOD)
+Jonathan Buladas Dumale (SFisGOD)
Joost VandeVondele (vondele)
Jörg Oster (joergoster)
Joseph Ellis (jhellis3)
marotear
Matthew Lai (matthewlai)
Matthew Sullivan (Matt14916)
+Maxim Molchanov (Maxim)
Michael An (man)
Michael Byrne (MichaelB7)
Michael Chaly (Vizvezdenec)
The Stockfish engine features two evaluation functions for chess, the classical
evaluation based on handcrafted terms, and the NNUE evaluation based on efficiently
-updateable neural networks. The classical evaluation runs efficiently on almost all
+updatable neural networks. The classical evaluation runs efficiently on almost all
CPU architectures, while the NNUE evaluation benefits from the vector
intrinsics available on most CPUs (sse2, avx2, neon, or similar).
* #### SyzygyProbeDepth
Minimum remaining search depth for which a position is probed. Set this option
- to a higher value to probe less agressively if you experience too much slowdown
+ to a higher value to probe less aggressively if you experience too much slowdown
(in terms of nps) due to TB probing.
* #### Syzygy50MoveRule
of various chess concepts, handcrafted by experts, tested and tuned using fishtest.
The NNUE evaluation computes this value with a neural network based on basic
inputs (e.g. piece positions only). The network is optimized and trained
-on the evalutions of millions of positions at moderate search depth.
+on the evaluations of millions of positions at moderate search depth.
The NNUE evaluation was first introduced in shogi, and ported to Stockfish afterward.
It can be evaluated efficiently on CPUs, and exploits the fact that only parts
If the engine is searching a position that is not in the tablebases (e.g.
a position with 8 pieces), it will access the tablebases during the search.
-If the engine reports a very large score (typically 153.xx), this means
-that it has found a winning line into a tablebase position.
+If the engine reports a very large score (typically 153.xx), this means
+it has found a winning line into a tablebase position.
If the engine is given a position to search that is in the tablebases, it
will use the tablebases at the beginning of the search to preselect all
taking into account the 50-move rule.
It will then perform a search only on those moves. **The engine will not move
immediately**, unless there is only a single good move. **The engine likely
-will not report a mate score even if the position is known to be won.**
+will not report a mate score, even if the position is known to be won.**
It is therefore clear that this behaviour is not identical to what one might
be used to with Nalimov tablebases. There are technical reasons for this
Stockfish supports large pages on Linux and Windows. Large pages make
the hash access more efficient, improving the engine speed, especially
-on large hash sizes. Typical increases are 5..10% in terms of nps, but
-speed increases up to 30% have been measured. The support is
+on large hash sizes. Typical increases are 5..10% in terms of nodes per
+second, but speed increases up to 30% have been measured. The support is
automatic. Stockfish attempts to use large pages when available and
will fall back to regular memory allocation when this is not the case.
Large page support on Linux is obtained by the Linux kernel
transparent huge pages functionality. Typically, transparent huge pages
-are already enabled and no configuration is needed.
+are already enabled, and no configuration is needed.
### Support on Windows
The use of large pages requires "Lock Pages in Memory" privilege. See
[Enable the Lock Pages in Memory Option (Windows)](https://docs.microsoft.com/en-us/sql/database-engine/configure-windows/enable-the-lock-pages-in-memory-option-windows)
-on how to enable this privilege. Logout/login may be needed
-afterwards. Due to memory fragmentation, it may not always be
-possible to allocate large pages even when enabled. A reboot
-might alleviate this problem. To determine whether large pages
-are in use, see the engine log.
+on how to enable this privilege, then run [RAMMap](https://docs.microsoft.com/en-us/sysinternals/downloads/rammap)
+to double-check that large pages are used. We suggest that you reboot
+your computer after you have enabled large pages, because long Windows
+sessions suffer from memory fragmentation, which may prevent Stockfish
+from getting large pages: a fresh session is better in this regard.
## Compiling Stockfish yourself from the sources
```
cd src
make help
- make build ARCH=x86-64-modern
make net
+ make build ARCH=x86-64-modern
```
-When not using the Makefile to compile (for instance with Microsoft MSVC) you
+When not using the Makefile to compile (for instance, with Microsoft MSVC) you
need to manually set/unset some switches in the compiler command line; see
file *types.h* for a quick reference.
## Terms of use
Stockfish is free, and distributed under the **GNU General Public License version 3**
-(GPL v3). Essentially, this means that you are free to do almost exactly
+(GPL v3). Essentially, this means you are free to do almost exactly
what you want with the program, including distributing it among your
-friends, making it available for download from your web site, selling
+friends, making it available for download from your website, selling
it (either by itself or as part of some bigger software package), or
using it as the starting point for a software project of your own.
endif
endif
-ifeq ($(comp),icc)
- profile_make = icc-profile-make
- profile_use = icc-profile-use
-else
-ifeq ($(comp),clang)
- profile_make = clang-profile-make
- profile_use = clang-profile-use
-else
- profile_make = gcc-profile-make
- profile_use = gcc-profile-use
-endif
-endif
-
ifeq ($(KERNEL),Darwin)
CXXFLAGS += -arch $(arch) -mmacosx-version-min=10.14
LDFLAGS += -arch $(arch) -mmacosx-version-min=10.14
# Currently we don't know how to make PGO builds with the NDK yet.
ifeq ($(COMP),ndk)
CXXFLAGS += -stdlib=libc++ -fPIE
+ comp=clang
ifeq ($(arch),armv7)
- comp=armv7a-linux-androideabi16-clang
CXX=armv7a-linux-androideabi16-clang++
CXXFLAGS += -mthumb -march=armv7-a -mfloat-abi=softfp -mfpu=neon
STRIP=arm-linux-androideabi-strip
endif
ifeq ($(arch),armv8)
- comp=aarch64-linux-android21-clang
CXX=aarch64-linux-android21-clang++
STRIP=aarch64-linux-android-strip
endif
LDFLAGS += -static-libstdc++ -pie -lm -latomic
endif
+ifeq ($(comp),icc)
+ profile_make = icc-profile-make
+ profile_use = icc-profile-use
+else ifeq ($(comp),clang)
+ profile_make = clang-profile-make
+ profile_use = clang-profile-use
+else
+ profile_make = gcc-profile-make
+ profile_use = gcc-profile-use
+endif
+
### Travis CI script uses COMPILER to overwrite CXX
ifdef COMPILER
COMPCXX=$(COMPILER)
### needs access to the optimization flags.
ifeq ($(optimize),yes)
ifeq ($(debug), no)
- ifeq ($(COMP),ndk)
- CXXFLAGS += -flto=thin
- LDFLAGS += $(CXXFLAGS)
- else ifeq ($(comp),clang)
+ ifeq ($(comp),clang)
CXXFLAGS += -flto=thin
ifneq ($(findstring MINGW,$(KERNEL)),)
CXXFLAGS += -fuse-ld=lld
config-sanity icc-profile-use icc-profile-make gcc-profile-use gcc-profile-make \
clang-profile-use clang-profile-make
-build: config-sanity net
+build: net config-sanity
$(MAKE) ARCH=$(ARCH) COMP=$(COMP) all
profile-build: net config-sanity objclean profileclean
all: $(EXE) client .depend
-config-sanity:
+config-sanity: net
@echo ""
@echo "Config:"
@echo "debug: '$(debug)'"
assert(verify_material(pos, strongSide, RookValueMg, 2));
assert(verify_material(pos, weakSide, RookValueMg, 1));
- Square strongPawn1 = pos.squares<PAWN>(strongSide)[0];
- Square strongPawn2 = pos.squares<PAWN>(strongSide)[1];
+ Square strongPawn1 = lsb(pos.pieces(strongSide, PAWN));
+ Square strongPawn2 = msb(pos.pieces(strongSide, PAWN));
Square weakKing = pos.square<KING>(weakSide);
// Does the stronger side have a passed pawn?
return SCALE_FACTOR_NONE;
Square weakKing = pos.square<KING>(weakSide);
- Square strongPawn1 = pos.squares<PAWN>(strongSide)[0];
- Square strongPawn2 = pos.squares<PAWN>(strongSide)[1];
+ Square strongPawn1 = lsb(pos.pieces(strongSide, PAWN));
+ Square strongPawn2 = msb(pos.pieces(strongSide, PAWN));
Square blockSq1, blockSq2;
if (relative_rank(strongSide, strongPawn1) > relative_rank(strongSide, strongPawn2))
bool useNNUE;
string eval_file_loaded = "None";
- /// init_NNUE() tries to load a nnue network at startup time, or when the engine
+ /// NNUE::init() tries to load a nnue network at startup time, or when the engine
/// receives a UCI command "setoption name EvalFile value nn-[a-z0-9]{12}.nnue"
/// The name of the nnue network is always retrieved from the EvalFile option.
/// We search the given network in three locations: internally (the default
/// in the engine directory. Distro packagers may define the DEFAULT_NNUE_DIRECTORY
/// variable to have the engine search in a special directory in their distro.
- void init_NNUE() {
+ void NNUE::init() {
useNNUE = Options["Use NNUE"];
if (!useNNUE)
}
}
- /// verify_NNUE() verifies that the last net used was loaded successfully
- void verify_NNUE() {
+ /// NNUE::verify() verifies that the last net used was loaded successfully
+ void NNUE::verify() {
string eval_file = string(Options["EvalFile"]);
namespace {
// Threshold for lazy and space evaluation
- constexpr Value LazyThreshold1 = Value(1400);
- constexpr Value LazyThreshold2 = Value(1300);
- constexpr Value SpaceThreshold = Value(12222);
- constexpr Value NNUEThreshold1 = Value(550);
- constexpr Value NNUEThreshold2 = Value(150);
+ constexpr Value LazyThreshold1 = Value(1565);
+ constexpr Value LazyThreshold2 = Value(1102);
+ constexpr Value SpaceThreshold = Value(11551);
+ constexpr Value NNUEThreshold1 = Value(682);
+ constexpr Value NNUEThreshold2 = Value(176);
// KingAttackWeights[PieceType] contains king attack weights by piece type
constexpr int KingAttackWeights[PIECE_TYPE_NB] = { 0, 0, 81, 52, 44, 10 };
// SafeCheck[PieceType][single/multiple] contains safe check bonus by piece type,
// higher if multiple safe checks are possible for that piece type.
constexpr int SafeCheck[][2] = {
- {}, {}, {792, 1283}, {645, 967}, {1084, 1897}, {772, 1119}
+ {}, {}, {803, 1292}, {639, 974}, {1087, 1878}, {759, 1132}
};
#define S(mg, eg) make_score(mg, eg)
// MobilityBonus[PieceType-2][attacked] contains bonuses for middle and end game,
// indexed by piece type and number of attacked squares in the mobility area.
constexpr Score MobilityBonus[][32] = {
- { S(-62,-81), S(-53,-56), S(-12,-31), S( -4,-16), S( 3, 5), S( 13, 11), // Knight
- S( 22, 17), S( 28, 20), S( 33, 25) },
- { S(-48,-59), S(-20,-23), S( 16, -3), S( 26, 13), S( 38, 24), S( 51, 42), // Bishop
- S( 55, 54), S( 63, 57), S( 63, 65), S( 68, 73), S( 81, 78), S( 81, 86),
- S( 91, 88), S( 98, 97) },
- { S(-60,-78), S(-20,-17), S( 2, 23), S( 3, 39), S( 3, 70), S( 11, 99), // Rook
- S( 22,103), S( 31,121), S( 40,134), S( 40,139), S( 41,158), S( 48,164),
- S( 57,168), S( 57,169), S( 62,172) },
- { S(-30,-48), S(-12,-30), S( -8, -7), S( -9, 19), S( 20, 40), S( 23, 55), // Queen
- S( 23, 59), S( 35, 75), S( 38, 78), S( 53, 96), S( 64, 96), S( 65,100),
- S( 65,121), S( 66,127), S( 67,131), S( 67,133), S( 72,136), S( 72,141),
- S( 77,147), S( 79,150), S( 93,151), S(108,168), S(108,168), S(108,171),
- S(110,182), S(114,182), S(114,192), S(116,219) }
+ { S(-62,-79), S(-53,-57), S(-12,-31), S( -3,-17), S( 3, 7), S( 12, 13), // Knight
+ S( 21, 16), S( 28, 21), S( 37, 26) },
+ { S(-47,-59), S(-20,-25), S( 14, -8), S( 29, 12), S( 39, 21), S( 53, 40), // Bishop
+ S( 53, 56), S( 60, 58), S( 62, 65), S( 69, 72), S( 78, 78), S( 83, 87),
+ S( 91, 88), S( 96, 98) },
+ { S(-60,-82), S(-24,-15), S( 0, 17) ,S( 3, 43), S( 4, 72), S( 14,100), // Rook
+ S( 20,102), S( 30,122), S( 41,133), S(41 ,139), S( 41,153), S( 45,160),
+ S( 57,165), S( 58,170), S( 67,175) },
+ { S(-29,-49), S(-16,-29), S( -8, -8), S( -8, 17), S( 18, 39), S( 25, 54), // Queen
+ S( 23, 59), S( 37, 73), S( 41, 76), S( 54, 95), S( 65, 95) ,S( 68,101),
+ S( 69,124), S( 70,128), S( 70,132), S( 70,133) ,S( 71,136), S( 72,140),
+ S( 74,147), S( 76,149), S( 90,153), S(104,169), S(105,171), S(106,171),
+ S(112,178), S(114,185), S(114,187), S(119,221) }
+ };
+
+ // BishopPawns[distance from edge] contains a file-dependent penalty for pawns on
+ // squares of the same color as our bishop.
+ constexpr Score BishopPawns[int(FILE_NB) / 2] = {
+ S(3, 8), S(3, 9), S(1, 8), S(3, 7)
};
// KingProtector[knight/bishop] contains penalty for each distance unit to own king
S(0, 0), S(9, 28), S(15, 31), S(17, 39), S(64, 70), S(171, 177), S(277, 260)
};
- // RookOnFile[semiopen/open] contains bonuses for each rook when there is
- // no (friendly) pawn on the rook file.
- constexpr Score RookOnFile[] = { S(19, 7), S(48, 27) };
+ constexpr Score RookOnClosedFile = S(10, 5);
+ constexpr Score RookOnOpenFile[] = { S(19, 7), S(48, 27) };
// ThreatByMinor/ByRook[attacked PieceType] contains bonuses according to
// which piece type attacks which one. Attacks on lesser pieces which are
// Assorted bonuses and penalties
constexpr Score BadOutpost = S( -7, 36);
constexpr Score BishopOnKingRing = S( 24, 0);
- constexpr Score BishopPawns = S( 3, 7);
constexpr Score BishopXRayPawns = S( 4, 5);
constexpr Score CorneredBishop = S( 50, 50);
constexpr Score FlankAttacks = S( 8, 0);
constexpr Score ReachableOutpost = S( 31, 22);
constexpr Score RestrictedPiece = S( 7, 7);
constexpr Score RookOnKingRing = S( 16, 0);
- constexpr Score RookOnQueenFile = S( 6, 11);
constexpr Score SliderOnQueen = S( 60, 18);
constexpr Score ThreatByKing = S( 24, 89);
constexpr Score ThreatByPawnPush = S( 48, 39);
constexpr Direction Down = -pawn_push(Us);
constexpr Bitboard OutpostRanks = (Us == WHITE ? Rank4BB | Rank5BB | Rank6BB
: Rank5BB | Rank4BB | Rank3BB);
- const Square* pl = pos.squares<Pt>(Us);
-
+ Bitboard b1 = pos.pieces(Us, Pt);
Bitboard b, bb;
Score score = SCORE_ZERO;
attackedBy[Us][Pt] = 0;
- for (Square s = *pl; s != SQ_NONE; s = *++pl)
- {
+ while (b1) {
+ Square s = pop_lsb(&b1);
+
// Find attacked squares, including x-ray attacks for bishops and rooks
b = Pt == BISHOP ? attacks_bb<BISHOP>(s, pos.pieces() ^ pos.pieces(QUEEN))
: Pt == ROOK ? attacks_bb< ROOK>(s, pos.pieces() ^ pos.pieces(QUEEN) ^ pos.pieces(Us, ROOK))
// when the bishop is outside the pawn chain.
Bitboard blocked = pos.pieces(Us, PAWN) & shift<Down>(pos.pieces());
- score -= BishopPawns * pos.pawns_on_same_color_squares(Us, s)
+ score -= BishopPawns[edge_distance(file_of(s))] * pos.pawns_on_same_color_squares(Us, s)
* (!(attackedBy[Us][PAWN] & s) + popcount(blocked & CenterFiles));
// Penalty for all enemy pawns x-rayed
if (Pt == ROOK)
{
- // Bonus for rook on the same file as a queen
- if (file_bb(s) & pos.pieces(QUEEN))
- score += RookOnQueenFile;
-
- // Bonus for rook on an open or semi-open file
+ // Bonuses for rook on a (semi-)open or closed file
if (pos.is_on_semiopen_file(Us, s))
- score += RookOnFile[pos.is_on_semiopen_file(Them, s)];
-
- // Penalty when trapped by the king, even more if the king cannot castle
- else if (mob <= 3)
{
- File kf = file_of(pos.square<KING>(Us));
- if ((kf < FILE_E) == (file_of(s) < kf))
- score -= TrappedRook * (1 + !pos.castling_rights(Us));
+ score += RookOnOpenFile[pos.is_on_semiopen_file(Them, s)];
+ }
+ else
+ {
+ // If our pawn on this file is blocked, increase penalty
+ if ( pos.pieces(Us, PAWN)
+ & shift<Down>(pos.pieces())
+ & file_bb(s))
+ {
+ score -= RookOnClosedFile;
+ }
+
+ // Penalty when trapped by the king, even more if the king cannot castle
+ if (mob <= 3)
+ {
+ File kf = file_of(pos.square<KING>(Us));
+ if ((kf < FILE_E) == (file_of(s) < kf))
+ score -= TrappedRook * (1 + !pos.castling_rights(Us));
+ }
}
}
int kingFlankAttack = popcount(b1) + popcount(b2);
int kingFlankDefense = popcount(b3);
- kingDanger += kingAttackersCount[Them] * kingAttackersWeight[Them]
- + 185 * popcount(kingRing[Us] & weak)
- + 148 * popcount(unsafeChecks)
- + 98 * popcount(pos.blockers_for_king(Us))
- + 69 * kingAttacksCount[Them]
- + 3 * kingFlankAttack * kingFlankAttack / 8
- + mg_value(mobility[Them] - mobility[Us])
- - 873 * !pos.count<QUEEN>(Them)
- - 100 * bool(attackedBy[Us][KNIGHT] & attackedBy[Us][KING])
- - 6 * mg_value(score) / 8
- - 4 * kingFlankDefense
- + 37;
+ kingDanger += kingAttackersCount[Them] * kingAttackersWeight[Them] // (~10 Elo)
+ + 185 * popcount(kingRing[Us] & weak) // (~15 Elo)
+ + 148 * popcount(unsafeChecks) // (~4 Elo)
+ + 98 * popcount(pos.blockers_for_king(Us)) // (~2 Elo)
+ + 69 * kingAttacksCount[Them] // (~0.5 Elo)
+ + 3 * kingFlankAttack * kingFlankAttack / 8 // (~0.5 Elo)
+ + mg_value(mobility[Them] - mobility[Us]) // (~0.5 Elo)
+ - 873 * !pos.count<QUEEN>(Them) // (~24 Elo)
+ - 100 * bool(attackedBy[Us][KNIGHT] & attackedBy[Us][KING]) // (~5 Elo)
+ - 6 * mg_value(score) / 8 // (~8 Elo)
+ - 4 * kingFlankDefense // (~5 Elo)
+ + 37; // (~0.5 Elo)
// Transform the kingDanger units into a Score, and subtract it from the evaluation
if (kingDanger > 100)
sf = 37 + 3 * (pos.count<QUEEN>(WHITE) == 1 ? pos.count<BISHOP>(BLACK) + pos.count<KNIGHT>(BLACK)
: pos.count<BISHOP>(WHITE) + pos.count<KNIGHT>(WHITE));
else
- sf = std::min(sf, 36 + 7 * pos.count<PAWN>(strongSide));
+ sf = std::min(sf, 36 + 7 * pos.count<PAWN>(strongSide)) - 4 * !pawnsOnBothFlanks;
+
+ sf -= 4 * !pawnsOnBothFlanks;
}
// Interpolate between the middlegame and (scaled by 'sf') endgame score
Value Eval::evaluate(const Position& pos) {
- // Use classical eval if there is a large imbalance
- // If there is a moderate imbalance, use classical eval with probability (1/8),
- // as derived from the node counter.
- bool useClassical = abs(eg_value(pos.psq_score())) * 16 > NNUEThreshold1 * (16 + pos.rule50_count());
- bool classical = !Eval::useNNUE
- || useClassical
- || (abs(eg_value(pos.psq_score())) > PawnValueMg / 4 && !(pos.this_thread()->nodes & 0xB));
- Value v = classical ? Evaluation<NO_TRACE>(pos).value()
- : NNUE::evaluate(pos) * 5 / 4 + Tempo;
-
- if ( useClassical
- && Eval::useNNUE
- && abs(v) * 16 < NNUEThreshold2 * (16 + pos.rule50_count()))
- v = NNUE::evaluate(pos) * 5 / 4 + Tempo;
+ Value v;
+
+ if (!Eval::useNNUE)
+ v = Evaluation<NO_TRACE>(pos).value();
+ else
+ {
+ // Scale and shift NNUE for compatibility with search and classical evaluation
+ auto adjusted_NNUE = [&](){
+ int mat = pos.non_pawn_material() + PawnValueMg * pos.count<PAWN>();
+ return NNUE::evaluate(pos) * (679 + mat / 32) / 1024 + Tempo;
+ };
+
+ // If there is PSQ imbalance use classical eval, with small probability if it is small
+ Value psq = Value(abs(eg_value(pos.psq_score())));
+ int r50 = 16 + pos.rule50_count();
+ bool largePsq = psq * 16 > (NNUEThreshold1 + pos.non_pawn_material() / 64) * r50;
+ bool classical = largePsq || (psq > PawnValueMg / 4 && !(pos.this_thread()->nodes & 0xB));
+
+ bool strongClassical = pos.non_pawn_material() < 2 * RookValueMg && pos.count<PAWN>() < 2;
+
+ v = classical || strongClassical ? Evaluation<NO_TRACE>(pos).value() : adjusted_NNUE();
+
+ // If the classical eval is small and imbalance large, use NNUE nevertheless.
+ // For the case of opposite colored bishops, switch to NNUE eval with
+ // small probability if the classical eval is less than the threshold.
+ if ( largePsq && !strongClassical
+ && ( abs(v) * 16 < NNUEThreshold2 * r50
+ || ( pos.opposite_bishops()
+ && abs(v) * 16 < (NNUEThreshold1 + pos.non_pawn_material() / 64) * r50
+ && !(pos.this_thread()->nodes & 0xB))))
+ v = adjusted_NNUE();
+ }
// Damp down the evaluation linearly when shuffling
v = v * (100 - pos.rule50_count()) / 100;
extern bool useNNUE;
extern std::string eval_file_loaded;
- void init_NNUE();
- void verify_NNUE();
// The default net name MUST follow the format nn-[SHA256 first 12 digits].nnue
// for the build process (profile-build and fishtest) to work. Do not change the
// name of the macro, as it is used in the Makefile.
- #define EvalFileDefaultName "nn-308d71810dff.nnue"
+ #define EvalFileDefaultName "nn-62ef826d1a6d.nnue"
namespace NNUE {
Value evaluate(const Position& pos);
- Value compute_eval(const Position& pos);
- void update_eval(const Position& pos);
- bool load_eval(std::string streamName, std::istream& stream);
+ bool load_eval(std::string name, std::istream& stream);
+ void init();
+ void verify();
} // namespace NNUE
Endgames::init();
Threads.set(size_t(Options["Threads"]));
Search::clear(); // After threads are up
- Eval::init_NNUE();
+ Eval::NNUE::init();
UCI::loop(argc, argv);
#endif
}
-/// aligned_ttmem_alloc() will return suitably aligned memory, if possible using large pages.
-/// The returned pointer is the aligned one, while the mem argument is the one that needs
-/// to be passed to free. With c++17 some of this functionality could be simplified.
+/// aligned_large_pages_alloc() will return suitably aligned memory, if possible using large pages.
-#if defined(__linux__) && !defined(__ANDROID__)
+#if defined(_WIN32)
-void* aligned_ttmem_alloc(size_t allocSize, void*& mem) {
-
- constexpr size_t alignment = 2 * 1024 * 1024; // assumed 2MB page sizes
- size_t size = ((allocSize + alignment - 1) / alignment) * alignment; // multiple of alignment
- if (posix_memalign(&mem, alignment, size))
- mem = nullptr;
-#if defined(MADV_HUGEPAGE)
- madvise(mem, allocSize, MADV_HUGEPAGE);
-#endif
- return mem;
-}
-
-#elif defined(_WIN64)
-
-static void* aligned_ttmem_alloc_large_pages(size_t allocSize) {
+static void* aligned_large_pages_alloc_win(size_t allocSize) {
HANDLE hProcessToken { };
LUID luid { };
return mem;
}
-void* aligned_ttmem_alloc(size_t allocSize, void*& mem) {
-
- static bool firstCall = true;
+void* aligned_large_pages_alloc(size_t allocSize) {
// Try to allocate large pages
- mem = aligned_ttmem_alloc_large_pages(allocSize);
-
- // Suppress info strings on the first call. The first call occurs before 'uci'
- // is received and in that case this output confuses some GUIs.
- if (!firstCall)
- {
- if (mem)
- sync_cout << "info string Hash table allocation: Windows large pages used." << sync_endl;
- else
- sync_cout << "info string Hash table allocation: Windows large pages not used." << sync_endl;
- }
- firstCall = false;
+ void* mem = aligned_large_pages_alloc_win(allocSize);
// Fall back to regular, page aligned, allocation if necessary
if (!mem)
#else
-void* aligned_ttmem_alloc(size_t allocSize, void*& mem) {
+void* aligned_large_pages_alloc(size_t allocSize) {
- constexpr size_t alignment = 64; // assumed cache line size
- size_t size = allocSize + alignment - 1; // allocate some extra space
- mem = malloc(size);
- void* ret = reinterpret_cast<void*>((uintptr_t(mem) + alignment - 1) & ~uintptr_t(alignment - 1));
- return ret;
+#if defined(__linux__)
+ constexpr size_t alignment = 2 * 1024 * 1024; // assumed 2MB page size
+#else
+ constexpr size_t alignment = 4096; // assumed small page size
+#endif
+
+ // round up to multiples of alignment
+ size_t size = ((allocSize + alignment - 1) / alignment) * alignment;
+ void *mem = std_aligned_alloc(alignment, size);
+#if defined(MADV_HUGEPAGE)
+ madvise(mem, size, MADV_HUGEPAGE);
+#endif
+ return mem;
}
#endif
-/// aligned_ttmem_free() will free the previously allocated ttmem
+/// aligned_large_pages_free() will free the previously allocated ttmem
-#if defined(_WIN64)
+#if defined(_WIN32)
-void aligned_ttmem_free(void* mem) {
+void aligned_large_pages_free(void* mem) {
if (mem && !VirtualFree(mem, 0, MEM_RELEASE))
{
#else
-void aligned_ttmem_free(void *mem) {
- free(mem);
+void aligned_large_pages_free(void *mem) {
+ std_aligned_free(mem);
}
#endif
string argv0; // path+name of the executable binary, as given by argv[0]
string binaryDirectory; // path of the executable directory
string workingDirectory; // path of the working directory
-string pathSeparator; // Separator for our current OS
void init(int argc, char* argv[]) {
(void)argc;
- string separator;
+ string pathSeparator;
// extract the path+name of the executable binary
argv0 = argv[0];
#include <ostream>
#include <string>
#include <vector>
+#include <cstdint>
#include "types.h"
void start_logger(const std::string& fname);
void* std_aligned_alloc(size_t alignment, size_t size);
void std_aligned_free(void* ptr);
-void* aligned_ttmem_alloc(size_t size, void*& mem);
-void aligned_ttmem_free(void* mem); // nop if mem == nullptr
+void* aligned_large_pages_alloc(size_t size); // memory aligned by page size, min alignment: 4096 bytes
+void aligned_large_pages_free(void* mem); // nop if mem == nullptr
void dbg_hit_on(bool b);
void dbg_hit_on(bool c, bool b);
#define sync_cout std::cout << IO_LOCK
#define sync_endl std::endl << IO_UNLOCK
+// `ptr` must point to an array of size at least
+// `sizeof(T) * N + alignment` bytes, where `N` is the
+// number of elements in the array.
+template <uintptr_t Alignment, typename T>
+T* align_ptr_up(T* ptr)
+{
+ static_assert(alignof(T) < Alignment);
+
+ const uintptr_t ptrint = reinterpret_cast<uintptr_t>(reinterpret_cast<char*>(ptr));
+ return reinterpret_cast<T*>(reinterpret_cast<char*>((ptrint + (Alignment - 1)) / Alignment * Alignment));
+}
/// xorshift64star Pseudo-Random Number Generator
/// This class is based on original code written and dedicated
static_assert(Pt != KING && Pt != PAWN, "Unsupported piece type in generate_moves()");
- const Square* pl = pos.squares<Pt>(Us);
+ Bitboard bb = pos.pieces(Us, Pt);
+
+ while (bb) {
+ Square from = pop_lsb(&bb);
- for (Square from = *pl; from != SQ_NONE; from = *++pl)
- {
if (Checks)
{
if ( (Pt == BISHOP || Pt == ROOK || Pt == QUEEN)
assert(d <= 0);
stage = (pos.checkers() ? EVASION_TT : QSEARCH_TT) +
- !(ttm && (depth > DEPTH_QS_RECAPTURES || to_sq(ttm) == recaptureSquare)
- && pos.pseudo_legal(ttm));
+ !( ttm
+ && (pos.checkers() || depth > DEPTH_QS_RECAPTURES || to_sq(ttm) == recaptureSquare)
+ && pos.pseudo_legal(ttm));
}
/// MovePicker constructor for ProbCut: we generate captures with SEE greater
#include "../position.h"
#include "../misc.h"
#include "../uci.h"
+#include "../types.h"
#include "evaluate_nnue.h"
namespace Eval::NNUE {
- uint32_t kpp_board_index[PIECE_NB][COLOR_NB] = {
- // convention: W - us, B - them
- // viewed from other side, W and B are reversed
- { PS_NONE, PS_NONE },
- { PS_W_PAWN, PS_B_PAWN },
- { PS_W_KNIGHT, PS_B_KNIGHT },
- { PS_W_BISHOP, PS_B_BISHOP },
- { PS_W_ROOK, PS_B_ROOK },
- { PS_W_QUEEN, PS_B_QUEEN },
- { PS_W_KING, PS_B_KING },
- { PS_NONE, PS_NONE },
- { PS_NONE, PS_NONE },
- { PS_B_PAWN, PS_W_PAWN },
- { PS_B_KNIGHT, PS_W_KNIGHT },
- { PS_B_BISHOP, PS_W_BISHOP },
- { PS_B_ROOK, PS_W_ROOK },
- { PS_B_QUEEN, PS_W_QUEEN },
- { PS_B_KING, PS_W_KING },
- { PS_NONE, PS_NONE }
- };
-
// Input feature converter
- AlignedPtr<FeatureTransformer> feature_transformer;
+ LargePagePtr<FeatureTransformer> feature_transformer;
// Evaluation function
AlignedPtr<Network> network;
std::memset(pointer.get(), 0, sizeof(T));
}
+ template <typename T>
+ void Initialize(LargePagePtr<T>& pointer) {
+
+ static_assert(alignof(T) <= 4096, "aligned_large_pages_alloc() may fail for such a big alignment requirement of T");
+ pointer.reset(reinterpret_cast<T*>(aligned_large_pages_alloc(sizeof(T))));
+ std::memset(pointer.get(), 0, sizeof(T));
+ }
+
// Read evaluation function parameters
template <typename T>
- bool ReadParameters(std::istream& stream, const AlignedPtr<T>& pointer) {
+ bool ReadParameters(std::istream& stream, T& reference) {
std::uint32_t header;
header = read_little_endian<std::uint32_t>(stream);
if (!stream || header != T::GetHashValue()) return false;
- return pointer->ReadParameters(stream);
+ return reference.ReadParameters(stream);
}
} // namespace Detail
std::string architecture;
if (!ReadHeader(stream, &hash_value, &architecture)) return false;
if (hash_value != kHashValue) return false;
- if (!Detail::ReadParameters(stream, feature_transformer)) return false;
- if (!Detail::ReadParameters(stream, network)) return false;
+ if (!Detail::ReadParameters(stream, *feature_transformer)) return false;
+ if (!Detail::ReadParameters(stream, *network)) return false;
return stream && stream.peek() == std::ios::traits_type::eof();
}
// Evaluation function. Perform differential calculation.
Value evaluate(const Position& pos) {
- alignas(kCacheLineSize) TransformedFeatureType
- transformed_features[FeatureTransformer::kBufferSize];
+ // We manually align the arrays on the stack because with gcc < 9.3
+ // overaligning stack variables with alignas() doesn't work correctly.
+
+ constexpr uint64_t alignment = kCacheLineSize;
+
+#if defined(ALIGNAS_ON_STACK_VARIABLES_BROKEN)
+ TransformedFeatureType transformed_features_unaligned[
+ FeatureTransformer::kBufferSize + alignment / sizeof(TransformedFeatureType)];
+ char buffer_unaligned[Network::kBufferSize + alignment];
+
+ auto* transformed_features = align_ptr_up<alignment>(&transformed_features_unaligned[0]);
+ auto* buffer = align_ptr_up<alignment>(&buffer_unaligned[0]);
+#else
+ alignas(alignment)
+ TransformedFeatureType transformed_features[FeatureTransformer::kBufferSize];
+ alignas(alignment) char buffer[Network::kBufferSize];
+#endif
+
+ ASSERT_ALIGNED(transformed_features, alignment);
+ ASSERT_ALIGNED(buffer, alignment);
+
feature_transformer->Transform(pos, transformed_features);
- alignas(kCacheLineSize) char buffer[Network::kBufferSize];
const auto output = network->Propagate(transformed_features, buffer);
return static_cast<Value>(output[0] / FV_SCALE);
}
// Load eval, from a file stream or a memory stream
- bool load_eval(std::string streamName, std::istream& stream) {
+ bool load_eval(std::string name, std::istream& stream) {
Initialize();
- fileName = streamName;
+ fileName = name;
return ReadParameters(stream);
}
}
};
+ template <typename T>
+ struct LargePageDeleter {
+ void operator()(T* ptr) const {
+ ptr->~T();
+ aligned_large_pages_free(ptr);
+ }
+ };
+
template <typename T>
using AlignedPtr = std::unique_ptr<T, AlignedDeleter<T>>;
+ template <typename T>
+ using LargePagePtr = std::unique_ptr<T, LargePageDeleter<T>>;
+
} // namespace Eval::NNUE
#endif // #ifndef NNUE_EVALUATE_NNUE_H_INCLUDED
template <typename Derived>
class FeatureSetBase {
- public:
- // Get a list of indices for active features
- template <typename IndexListType>
- static void AppendActiveIndices(
- const Position& pos, TriggerEvent trigger, IndexListType active[2]) {
-
- for (Color perspective : { WHITE, BLACK }) {
- Derived::CollectActiveIndices(
- pos, trigger, perspective, &active[perspective]);
- }
- }
-
- // Get a list of indices for recently changed features
- template <typename PositionType, typename IndexListType>
- static void AppendChangedIndices(
- const PositionType& pos, TriggerEvent trigger,
- IndexListType removed[2], IndexListType added[2], bool reset[2]) {
-
- const auto& dp = pos.state()->dirtyPiece;
- if (dp.dirty_num == 0) return;
-
- for (Color perspective : { WHITE, BLACK }) {
- reset[perspective] = false;
- switch (trigger) {
- case TriggerEvent::kFriendKingMoved:
- reset[perspective] = dp.piece[0] == make_piece(perspective, KING);
- break;
- default:
- assert(false);
- break;
- }
- if (reset[perspective]) {
- Derived::CollectActiveIndices(
- pos, trigger, perspective, &added[perspective]);
- } else {
- Derived::CollectChangedIndices(
- pos, trigger, perspective,
- &removed[perspective], &added[perspective]);
- }
- }
- }
};
// Class template that represents the feature set
CompileTimeList<TriggerEvent, FeatureType::kRefreshTrigger>;
static constexpr auto kRefreshTriggers = SortedTriggerSet::kValues;
- private:
- // Get a list of indices for active features
- static void CollectActiveIndices(
- const Position& pos, const TriggerEvent trigger, const Color perspective,
- IndexList* const active) {
- if (FeatureType::kRefreshTrigger == trigger) {
- FeatureType::AppendActiveIndices(pos, perspective, active);
- }
- }
-
- // Get a list of indices for recently changed features
- static void CollectChangedIndices(
- const Position& pos, const TriggerEvent trigger, const Color perspective,
- IndexList* const removed, IndexList* const added) {
-
- if (FeatureType::kRefreshTrigger == trigger) {
- FeatureType::AppendChangedIndices(pos, perspective, removed, added);
- }
- }
-
- // Make the base class and the class template that recursively uses itself a friend
- friend class FeatureSetBase<FeatureSet>;
- template <typename... FeatureTypes>
- friend class FeatureSet;
};
} // namespace Eval::NNUE::Features
return Square(int(s) ^ (bool(perspective) * 63));
}
- // Find the index of the feature quantity from the king position and PieceSquare
- template <Side AssociatedKing>
- inline IndexType HalfKP<AssociatedKing>::MakeIndex(
- Color perspective, Square s, Piece pc, Square ksq) {
-
- return IndexType(orient(perspective, s) + kpp_board_index[pc][perspective] + PS_END * ksq);
+ // Index of a feature for a given king position and another piece on some square
+ inline IndexType make_index(Color perspective, Square s, Piece pc, Square ksq) {
+ return IndexType(orient(perspective, s) + kpp_board_index[perspective][pc] + PS_END * ksq);
}
// Get a list of indices for active features
Bitboard bb = pos.pieces() & ~pos.pieces(KING);
while (bb) {
Square s = pop_lsb(&bb);
- active->push_back(MakeIndex(perspective, s, pos.piece_on(s), ksq));
+ active->push_back(make_index(perspective, s, pos.piece_on(s), ksq));
}
}
// Get a list of indices for recently changed features
template <Side AssociatedKing>
void HalfKP<AssociatedKing>::AppendChangedIndices(
- const Position& pos, Color perspective,
+ const Position& pos, const DirtyPiece& dp, Color perspective,
IndexList* removed, IndexList* added) {
Square ksq = orient(perspective, pos.square<KING>(perspective));
- const auto& dp = pos.state()->dirtyPiece;
for (int i = 0; i < dp.dirty_num; ++i) {
Piece pc = dp.piece[i];
if (type_of(pc) == KING) continue;
if (dp.from[i] != SQ_NONE)
- removed->push_back(MakeIndex(perspective, dp.from[i], pc, ksq));
+ removed->push_back(make_index(perspective, dp.from[i], pc, ksq));
if (dp.to[i] != SQ_NONE)
- added->push_back(MakeIndex(perspective, dp.to[i], pc, ksq));
+ added->push_back(make_index(perspective, dp.to[i], pc, ksq));
}
}
IndexList* active);
// Get a list of indices for recently changed features
- static void AppendChangedIndices(const Position& pos, Color perspective,
+ static void AppendChangedIndices(const Position& pos, const DirtyPiece& dp, Color perspective,
IndexList* removed, IndexList* added);
-
- private:
- // Index of a feature for a given king position and another piece on some square
- static IndexType MakeIndex(Color perspective, Square s, Piece pc, Square sq_k);
};
} // namespace Eval::NNUE::Features
const TransformedFeatureType* transformed_features, char* buffer) const {
const auto input = previous_layer_.Propagate(
transformed_features, buffer + kSelfBufferSize);
+
+#if defined (USE_AVX512)
+
+ [[maybe_unused]] const __m512i kOnes512 = _mm512_set1_epi16(1);
+
+ [[maybe_unused]] auto m512_hadd = [](__m512i sum, int bias) -> int {
+ return _mm512_reduce_add_epi32(sum) + bias;
+ };
+
+ // This function takes
+ // sum0 = [xmm0a, xmm0b, xmm0c, xmm0d]
+ // sum1 = [xmm1a, xmm1b, xmm1c, xmm1d]
+ // sum2 = [xmm2a, xmm2b, xmm2c, xmm2d]
+ // sum3 = [xmm3a, xmm3b, xmm3c, xmm3d]
+ // and returns
+ // ret = [
+ // reduce_add_epi32(xmm0a), reduce_add_epi32(xmm1a), reduce_add_epi32(xmm2a), reduce_add_epi32(xmm3a),
+ // reduce_add_epi32(xmm0b), reduce_add_epi32(xmm1b), reduce_add_epi32(xmm2b), reduce_add_epi32(xmm3b),
+ // reduce_add_epi32(xmm0c), reduce_add_epi32(xmm1c), reduce_add_epi32(xmm2c), reduce_add_epi32(xmm3c),
+ // reduce_add_epi32(xmm0d), reduce_add_epi32(xmm1d), reduce_add_epi32(xmm2d), reduce_add_epi32(xmm3d)
+ // ]
+ [[maybe_unused]] auto m512_hadd128x16_interleave = [](
+ __m512i sum0, __m512i sum1, __m512i sum2, __m512i sum3) -> __m512i {
+
+ __m512i sum01a = _mm512_unpacklo_epi32(sum0, sum1);
+ __m512i sum01b = _mm512_unpackhi_epi32(sum0, sum1);
+
+ __m512i sum23a = _mm512_unpacklo_epi32(sum2, sum3);
+ __m512i sum23b = _mm512_unpackhi_epi32(sum2, sum3);
+
+ __m512i sum01 = _mm512_add_epi32(sum01a, sum01b);
+ __m512i sum23 = _mm512_add_epi32(sum23a, sum23b);
+
+ __m512i sum0123a = _mm512_unpacklo_epi64(sum01, sum23);
+ __m512i sum0123b = _mm512_unpackhi_epi64(sum01, sum23);
+
+ return _mm512_add_epi32(sum0123a, sum0123b);
+ };
+
+ [[maybe_unused]] auto m512_haddx4 = [m512_hadd128x16_interleave](
+ __m512i sum0, __m512i sum1, __m512i sum2, __m512i sum3, __m128i bias) -> __m128i {
+
+ __m512i sum = m512_hadd128x16_interleave(sum0, sum1, sum2, sum3);
+
+ __m256i sum256lo = _mm512_castsi512_si256(sum);
+ __m256i sum256hi = _mm512_extracti64x4_epi64(sum, 1);
+
+ sum256lo = _mm256_add_epi32(sum256lo, sum256hi);
+
+ __m128i sum128lo = _mm256_castsi256_si128(sum256lo);
+ __m128i sum128hi = _mm256_extracti128_si256(sum256lo, 1);
+
+ return _mm_add_epi32(_mm_add_epi32(sum128lo, sum128hi), bias);
+ };
+
+ [[maybe_unused]] auto m512_haddx8 = [m512_hadd128x16_interleave](
+ __m512i sum0, __m512i sum1, __m512i sum2, __m512i sum3,
+ __m512i sum4, __m512i sum5, __m512i sum6, __m512i sum7, __m256i bias) -> __m256i {
+
+ __m512i suma = m512_hadd128x16_interleave(sum0, sum1, sum2, sum3);
+ __m512i sumb = m512_hadd128x16_interleave(sum4, sum5, sum6, sum7);
+
+ __m512i indices0 = _mm512_setr_epi64(0, 1, 8, 9, 4, 5, 12, 13);
+ __m512i indices1 = _mm512_setr_epi64(2, 3, 10, 11, 6, 7, 14, 15);
+ __m512i x = _mm512_add_epi32(
+ _mm512_permutex2var_epi64(suma, indices0, sumb),
+ _mm512_permutex2var_epi64(suma, indices1, sumb));
+
+ __m256i sum256lo = _mm512_castsi512_si256(x);
+ __m256i sum256hi = _mm512_extracti64x4_epi64(x, 1);
+
+ return _mm256_add_epi32(_mm256_add_epi32(sum256lo, sum256hi), bias);
+ };
+
+ [[maybe_unused]] auto m512_hadd256x8 =[m512_hadd128x16_interleave](
+ __m512i sum0, __m512i sum1, __m512i sum2, __m512i sum3, __m256i bias) -> __m256i {
+
+ __m512i sum = m512_hadd128x16_interleave(sum0, sum1, sum2, sum3);
+
+ __m512i indices = _mm512_setr_epi32(
+ 0, 4, 8, 12, 2, 6, 10, 14,
+ 1, 5, 9, 13, 3, 7, 11, 15);
+ sum = _mm512_permutexvar_epi32(indices, sum);
+
+ __m256i sum256lo = _mm512_castsi512_si256(sum);
+ __m256i sum256hi = _mm512_extracti64x4_epi64(sum, 1);
+
+ return _mm256_add_epi32(_mm256_hadd_epi32(sum256lo, sum256hi), bias);
+ };
+
+ [[maybe_unused]] auto m512_hadd256x16 = [m512_hadd128x16_interleave](
+ __m512i sum0, __m512i sum1, __m512i sum2, __m512i sum3,
+ __m512i sum4, __m512i sum5, __m512i sum6, __m512i sum7, __m512i bias) -> __m512i {
+
+ __m512i suma = m512_hadd128x16_interleave(sum0, sum1, sum2, sum3);
+ __m512i sumb = m512_hadd128x16_interleave(sum4, sum5, sum6, sum7);
+
+ __m512i indices0 = _mm512_setr_epi64(0, 1, 8, 9, 4, 5, 12, 13);
+ __m512i indices1 = _mm512_setr_epi64(2, 3, 10, 11, 6, 7, 14, 15);
+ __m512i x = _mm512_add_epi32(
+ _mm512_permutex2var_epi64(suma, indices0, sumb),
+ _mm512_permutex2var_epi64(suma, indices1, sumb));
+
+ __m512i indices = _mm512_setr_epi32(0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15);
+ return _mm512_add_epi32(_mm512_permutexvar_epi32(indices, x), bias);
+ };
+
+#if defined (USE_VNNI)
+ [[maybe_unused]] auto m512_add_dpbusd_epi32 = [=](__m512i& acc, __m512i a, __m512i b) {
+ acc = _mm512_dpbusd_epi32(acc, a, b);
+#else
+ [[maybe_unused]] auto m512_dpbusd_epi32 = [=](__m512i a, __m512i b) -> __m512i {
+ __m512i product0 = _mm512_maddubs_epi16(a, b);
+ return _mm512_madd_epi16(product0, kOnes512);
+#endif
+ };
+
+#endif
+#if defined (USE_AVX2)
+
+ [[maybe_unused]] const __m256i kOnes256 = _mm256_set1_epi16(1);
+
+ [[maybe_unused]] auto m256_hadd = [](__m256i sum, int bias) -> int {
+ __m128i sum128 = _mm_add_epi32(_mm256_castsi256_si128(sum), _mm256_extracti128_si256(sum, 1));
+ sum128 = _mm_add_epi32(sum128, _mm_shuffle_epi32(sum128, _MM_PERM_BADC));
+ sum128 = _mm_add_epi32(sum128, _mm_shuffle_epi32(sum128, _MM_PERM_CDAB));
+ return _mm_cvtsi128_si32(sum128) + bias;
+ };
+
+ [[maybe_unused]] auto m256_haddx4 = [](__m256i sum0, __m256i sum1, __m256i sum2, __m256i sum3, __m128i bias) -> __m128i {
+ sum0 = _mm256_hadd_epi32(sum0, sum1);
+ sum2 = _mm256_hadd_epi32(sum2, sum3);
+
+ sum0 = _mm256_hadd_epi32(sum0, sum2);
+
+ __m128i sum128lo = _mm256_castsi256_si128(sum0);
+ __m128i sum128hi = _mm256_extracti128_si256(sum0, 1);
+
+ return _mm_add_epi32(_mm_add_epi32(sum128lo, sum128hi), bias);
+ };
+#if defined (USE_VNNI)
+ [[maybe_unused]] auto m256_add_dpbusd_epi32 = [=](__m256i& acc, __m256i a, __m256i b) {
+ acc = _mm256_dpbusd_epi32(acc, a, b);
+#else
+ [[maybe_unused]] auto m256_dpbusd_epi32 = [=](__m256i a, __m256i b) -> __m256i {
+ __m256i product0 = _mm256_maddubs_epi16(a, b);
+ return _mm256_madd_epi16(product0, kOnes256);
+#endif
+ };
+
+#endif
+
+#if defined (USE_SSSE3)
+
+ [[maybe_unused]] const __m128i kOnes128 = _mm_set1_epi16(1);
+
+ [[maybe_unused]] auto m128_hadd = [](__m128i sum, int bias) -> int {
+ sum = _mm_add_epi32(sum, _mm_shuffle_epi32(sum, 0x4E)); //_MM_PERM_BADC
+ sum = _mm_add_epi32(sum, _mm_shuffle_epi32(sum, 0xB1)); //_MM_PERM_CDAB
+ return _mm_cvtsi128_si32(sum) + bias;
+ };
+
+ [[maybe_unused]] auto m128_haddx4 = [](__m128i sum0, __m128i sum1, __m128i sum2, __m128i sum3, __m128i bias) -> __m128i {
+ sum0 = _mm_hadd_epi32(sum0, sum1);
+ sum2 = _mm_hadd_epi32(sum2, sum3);
+
+ sum0 = _mm_hadd_epi32(sum0, sum2);
+
+ return _mm_add_epi32(sum0, bias);
+ };
+
+ [[maybe_unused]] auto m128_dpbusd_epi32 = [=](__m128i a, __m128i b) -> __m128i {
+ __m128i product0 = _mm_maddubs_epi16(a, b);
+ return _mm_madd_epi16(product0, kOnes128);
+ };
+
+#endif
+
+#if defined (USE_AVX512)
+
+ constexpr IndexType kNumChunks512 = kPaddedInputDimensions / (kSimdWidth * 2);
+ constexpr IndexType kNumChunks256 = kPaddedInputDimensions / kSimdWidth;
+
const auto output = reinterpret_cast<OutputType*>(buffer);
- #if defined(USE_AVX512)
- constexpr IndexType kNumChunks = kPaddedInputDimensions / (kSimdWidth * 2);
- const auto input_vector = reinterpret_cast<const __m512i*>(input);
- #if !defined(USE_VNNI)
- const __m512i kOnes = _mm512_set1_epi16(1);
- #endif
+ // Since to saturate a zmm register it takes 64 bytes we
+ // cannot use AVX512 for the smaller affine transforms.
+ // Instead we fallback to a AVX2 implementation if the
+ // kInputDimensions isn't a multiple of 64.
+ // Note that this means that for example for
+ // kInputDimensions of 96 we fallback to AVX2 even though
+ // the first 64 elements could be processed with AVX512.
+ // This is caused by mixing the __m256 and __m512 variables
+ // required to better handle that case and it would
+ // require handling more cases statically not to lose performance.
+ // This should be revisited if such input dimensions are to be considered.
+ [[maybe_unused]] const auto input_vector512 = reinterpret_cast<const __m512i*>(input);
+ [[maybe_unused]] const auto input_vector256 = reinterpret_cast<const __m256i*>(input);
+
+ // kOutputDimensions is either 1 or a multiple of kSimdWidth
+ // because then it is also an input dimension.
+ if constexpr (kOutputDimensions % 16 == 0 && kNumChunks256 == 1)
+ {
+ for (IndexType i = 0; i < kOutputDimensions; i += 16)
+ {
+ const IndexType offset01a = (i + 0) * kPaddedInputDimensions;
+ const IndexType offset23a = (i + 2) * kPaddedInputDimensions;
+ const IndexType offset45a = (i + 4) * kPaddedInputDimensions;
+ const IndexType offset67a = (i + 6) * kPaddedInputDimensions;
+ const IndexType offset01b = (i + 8) * kPaddedInputDimensions;
+ const IndexType offset23b = (i + 10) * kPaddedInputDimensions;
+ const IndexType offset45b = (i + 12) * kPaddedInputDimensions;
+ const IndexType offset67b = (i + 14) * kPaddedInputDimensions;
+
+ const __m512i bias = *reinterpret_cast<const __m512i*>(&biases_[i]);
+ __m512i* outptr = reinterpret_cast<__m512i*>(&output[i]);
+
+ const auto row01a = *reinterpret_cast<const __m512i*>(&weights_[offset01a]);
+ const auto row23a = *reinterpret_cast<const __m512i*>(&weights_[offset23a]);
+ const auto row45a = *reinterpret_cast<const __m512i*>(&weights_[offset45a]);
+ const auto row67a = *reinterpret_cast<const __m512i*>(&weights_[offset67a]);
+ const auto row01b = *reinterpret_cast<const __m512i*>(&weights_[offset01b]);
+ const auto row23b = *reinterpret_cast<const __m512i*>(&weights_[offset23b]);
+ const auto row45b = *reinterpret_cast<const __m512i*>(&weights_[offset45b]);
+ const auto row67b = *reinterpret_cast<const __m512i*>(&weights_[offset67b]);
+
+ const __m256i in256 = input_vector256[0];
+ const __m512i in = _mm512_inserti64x4(_mm512_castsi256_si512(in256), in256, 1);
+
+#if defined (USE_VNNI)
+ __m512i sum01a = _mm512_setzero_si512();
+ __m512i sum23a = _mm512_setzero_si512();
+ __m512i sum45a = _mm512_setzero_si512();
+ __m512i sum67a = _mm512_setzero_si512();
+ __m512i sum01b = _mm512_setzero_si512();
+ __m512i sum23b = _mm512_setzero_si512();
+ __m512i sum45b = _mm512_setzero_si512();
+ __m512i sum67b = _mm512_setzero_si512();
+
+ m512_add_dpbusd_epi32(sum01a, in, row01a);
+ m512_add_dpbusd_epi32(sum23a, in, row23a);
+ m512_add_dpbusd_epi32(sum45a, in, row45a);
+ m512_add_dpbusd_epi32(sum67a, in, row67a);
+ m512_add_dpbusd_epi32(sum01b, in, row01b);
+ m512_add_dpbusd_epi32(sum23b, in, row23b);
+ m512_add_dpbusd_epi32(sum45b, in, row45b);
+ m512_add_dpbusd_epi32(sum67b, in, row67b);
+#else
+ __m512i sum01a = m512_dpbusd_epi32(in, row01a);
+ __m512i sum23a = m512_dpbusd_epi32(in, row23a);
+ __m512i sum45a = m512_dpbusd_epi32(in, row45a);
+ __m512i sum67a = m512_dpbusd_epi32(in, row67a);
+ __m512i sum01b = m512_dpbusd_epi32(in, row01b);
+ __m512i sum23b = m512_dpbusd_epi32(in, row23b);
+ __m512i sum45b = m512_dpbusd_epi32(in, row45b);
+ __m512i sum67b = m512_dpbusd_epi32(in, row67b);
+#endif
+
+ *outptr = m512_hadd256x16(
+ sum01a, sum23a, sum45a, sum67a,
+ sum01b, sum23b, sum45b, sum67b, bias);
+ }
+ }
+ else if constexpr (kOutputDimensions % 4 == 0)
+ {
+ for (IndexType i = 0; i < kOutputDimensions; i += 4)
+ {
+ const IndexType offset0 = (i + 0) * kPaddedInputDimensions;
+ const IndexType offset1 = (i + 1) * kPaddedInputDimensions;
+ const IndexType offset2 = (i + 2) * kPaddedInputDimensions;
+ const IndexType offset3 = (i + 3) * kPaddedInputDimensions;
+
+ const __m128i bias = *reinterpret_cast<const __m128i*>(&biases_[i]);
+ __m128i* outptr = reinterpret_cast<__m128i*>(&output[i]);
+
+ if constexpr (kPaddedInputDimensions % (kSimdWidth * 2) == 0)
+ {
+ const auto row0 = reinterpret_cast<const __m512i*>(&weights_[offset0]);
+ const auto row1 = reinterpret_cast<const __m512i*>(&weights_[offset1]);
+ const auto row2 = reinterpret_cast<const __m512i*>(&weights_[offset2]);
+ const auto row3 = reinterpret_cast<const __m512i*>(&weights_[offset3]);
+
+#if defined (USE_VNNI)
+ __m512i sum0 = _mm512_setzero_si512();
+ __m512i sum1 = _mm512_setzero_si512();
+ __m512i sum2 = _mm512_setzero_si512();
+ __m512i sum3 = _mm512_setzero_si512();
+ const IndexType kStart = 0;
+#else
+ __m512i sum0 = m512_dpbusd_epi32(input_vector512[0], row0[0]);
+ __m512i sum1 = m512_dpbusd_epi32(input_vector512[0], row1[0]);
+ __m512i sum2 = m512_dpbusd_epi32(input_vector512[0], row2[0]);
+ __m512i sum3 = m512_dpbusd_epi32(input_vector512[0], row3[0]);
+ const IndexType kStart = 1;
+#endif
+
+ for (IndexType j = kStart; j < kNumChunks512; ++j)
+ {
+ const __m512i in = input_vector512[j];
+
+#if defined (USE_VNNI)
+ m512_add_dpbusd_epi32(sum0, in, row0[j]);
+ m512_add_dpbusd_epi32(sum1, in, row1[j]);
+ m512_add_dpbusd_epi32(sum2, in, row2[j]);
+ m512_add_dpbusd_epi32(sum3, in, row3[j]);
+#else
+ sum0 = _mm512_add_epi32(sum0, m512_dpbusd_epi32(in, row0[j]));
+ sum1 = _mm512_add_epi32(sum1, m512_dpbusd_epi32(in, row1[j]));
+ sum2 = _mm512_add_epi32(sum2, m512_dpbusd_epi32(in, row2[j]));
+ sum3 = _mm512_add_epi32(sum3, m512_dpbusd_epi32(in, row3[j]));
+#endif
+ }
+
+ *outptr = m512_haddx4(sum0, sum1, sum2, sum3, bias);
+ }
+ else
+ {
+ const auto row0 = reinterpret_cast<const __m256i*>(&weights_[offset0]);
+ const auto row1 = reinterpret_cast<const __m256i*>(&weights_[offset1]);
+ const auto row2 = reinterpret_cast<const __m256i*>(&weights_[offset2]);
+ const auto row3 = reinterpret_cast<const __m256i*>(&weights_[offset3]);
+
+#if defined (USE_VNNI)
+ __m256i sum0 = _mm256_setzero_si256();
+ __m256i sum1 = _mm256_setzero_si256();
+ __m256i sum2 = _mm256_setzero_si256();
+ __m256i sum3 = _mm256_setzero_si256();
+ const IndexType kStart = 0;
+#else
+ __m256i sum0 = m256_dpbusd_epi32(input_vector256[0], row0[0]);
+ __m256i sum1 = m256_dpbusd_epi32(input_vector256[0], row1[0]);
+ __m256i sum2 = m256_dpbusd_epi32(input_vector256[0], row2[0]);
+ __m256i sum3 = m256_dpbusd_epi32(input_vector256[0], row3[0]);
+ const IndexType kStart = 1;
+#endif
+
+ for (IndexType j = kStart; j < kNumChunks256; ++j)
+ {
+ const __m256i in = input_vector256[j];
+
+#if defined (USE_VNNI)
+ m256_add_dpbusd_epi32(sum0, in, row0[j]);
+ m256_add_dpbusd_epi32(sum1, in, row1[j]);
+ m256_add_dpbusd_epi32(sum2, in, row2[j]);
+ m256_add_dpbusd_epi32(sum3, in, row3[j]);
+#else
+ sum0 = _mm256_add_epi32(sum0, m256_dpbusd_epi32(in, row0[j]));
+ sum1 = _mm256_add_epi32(sum1, m256_dpbusd_epi32(in, row1[j]));
+ sum2 = _mm256_add_epi32(sum2, m256_dpbusd_epi32(in, row2[j]));
+ sum3 = _mm256_add_epi32(sum3, m256_dpbusd_epi32(in, row3[j]));
+#endif
+ }
+
+ *outptr = m256_haddx4(sum0, sum1, sum2, sum3, bias);
+ }
+ }
+ }
+ else if constexpr (kOutputDimensions == 1)
+ {
+ if constexpr (kPaddedInputDimensions % (kSimdWidth * 2) == 0)
+ {
+ const auto row0 = reinterpret_cast<const __m512i*>(&weights_[0]);
+
+#if defined (USE_VNNI)
+ __m512i sum0 = _mm512_setzero_si512();
+ const IndexType kStart = 0;
+#else
+ __m512i sum0 = m512_dpbusd_epi32(input_vector512[0], row0[0]);
+ const IndexType kStart = 1;
+#endif
+
+ for (IndexType j = kStart; j < kNumChunks512; ++j)
+ {
+ const __m512i in = input_vector512[j];
+
+#if defined (USE_VNNI)
+ m512_add_dpbusd_epi32(sum0, in, row0[j]);
+#else
+ sum0 = _mm512_add_epi32(sum0, m512_dpbusd_epi32(in, row0[j]));
+#endif
+ }
+
+ output[0] = m512_hadd(sum0, biases_[0]);
+ }
+ else
+ {
+ const auto row0 = reinterpret_cast<const __m256i*>(&weights_[0]);
+
+#if defined (USE_VNNI)
+ __m256i sum0 = _mm256_setzero_si256();
+ const IndexType kStart = 0;
+#else
+ __m256i sum0 = m256_dpbusd_epi32(input_vector256[0], row0[0]);
+ const IndexType kStart = 1;
+#endif
+
+ for (IndexType j = kStart; j < kNumChunks256; ++j)
+ {
+ const __m256i in = input_vector256[j];
+
+#if defined (USE_VNNI)
+ m256_add_dpbusd_epi32(sum0, in, row0[j]);
+#else
+ sum0 = _mm256_add_epi32(sum0, m256_dpbusd_epi32(in, row0[j]));
+#endif
+ }
+
+ output[0] = m256_hadd(sum0, biases_[0]);
+ }
+ }
+ else
+ {
+ // This case can never happen because kOutputDimensions
+ // is always 1 or a multiple of kSimdWidth.
+ assert(false);
+ }
+
+#elif defined (USE_AVX2)
- #elif defined(USE_AVX2)
constexpr IndexType kNumChunks = kPaddedInputDimensions / kSimdWidth;
+
+ const auto output = reinterpret_cast<OutputType*>(buffer);
const auto input_vector = reinterpret_cast<const __m256i*>(input);
- #if !defined(USE_VNNI)
- const __m256i kOnes = _mm256_set1_epi16(1);
- #endif
- #elif defined(USE_SSE2)
+ // kOutputDimensions is either 1 or a multiple of kSimdWidth
+ // because then it is also an input dimension.
+ if constexpr (kOutputDimensions % 4 == 0)
+ {
+ for (IndexType i = 0; i < kOutputDimensions; i += 4)
+ {
+ const IndexType offset0 = (i + 0) * kPaddedInputDimensions;
+ const IndexType offset1 = (i + 1) * kPaddedInputDimensions;
+ const IndexType offset2 = (i + 2) * kPaddedInputDimensions;
+ const IndexType offset3 = (i + 3) * kPaddedInputDimensions;
+
+ const __m128i bias = *reinterpret_cast<const __m128i*>(&biases_[i]);
+ __m128i* outptr = reinterpret_cast<__m128i*>(&output[i]);
+
+ const auto row0 = reinterpret_cast<const __m256i*>(&weights_[offset0]);
+ const auto row1 = reinterpret_cast<const __m256i*>(&weights_[offset1]);
+ const auto row2 = reinterpret_cast<const __m256i*>(&weights_[offset2]);
+ const auto row3 = reinterpret_cast<const __m256i*>(&weights_[offset3]);
+
+#if defined (USE_VNNI)
+ __m256i sum0 = _mm256_setzero_si256();
+ __m256i sum1 = _mm256_setzero_si256();
+ __m256i sum2 = _mm256_setzero_si256();
+ __m256i sum3 = _mm256_setzero_si256();
+ const IndexType kStart = 0;
+#else
+ __m256i sum0 = m256_dpbusd_epi32(input_vector[0], row0[0]);
+ __m256i sum1 = m256_dpbusd_epi32(input_vector[0], row1[0]);
+ __m256i sum2 = m256_dpbusd_epi32(input_vector[0], row2[0]);
+ __m256i sum3 = m256_dpbusd_epi32(input_vector[0], row3[0]);
+ const IndexType kStart = 1;
+#endif
+
+ for (IndexType j = kStart; j < kNumChunks; ++j)
+ {
+ const __m256i in = input_vector[j];
+
+#if defined (USE_VNNI)
+ m256_add_dpbusd_epi32(sum0, in, row0[j]);
+ m256_add_dpbusd_epi32(sum1, in, row1[j]);
+ m256_add_dpbusd_epi32(sum2, in, row2[j]);
+ m256_add_dpbusd_epi32(sum3, in, row3[j]);
+#else
+ sum0 = _mm256_add_epi32(sum0, m256_dpbusd_epi32(in, row0[j]));
+ sum1 = _mm256_add_epi32(sum1, m256_dpbusd_epi32(in, row1[j]));
+ sum2 = _mm256_add_epi32(sum2, m256_dpbusd_epi32(in, row2[j]));
+ sum3 = _mm256_add_epi32(sum3, m256_dpbusd_epi32(in, row3[j]));
+#endif
+ }
+
+ *outptr = m256_haddx4(sum0, sum1, sum2, sum3, bias);
+ }
+ }
+ else if constexpr (kOutputDimensions == 1)
+ {
+ const auto row0 = reinterpret_cast<const __m256i*>(&weights_[0]);
+
+#if defined (USE_VNNI)
+ __m256i sum0 = _mm256_setzero_si256();
+ const IndexType kStart = 0;
+#else
+ __m256i sum0 = m256_dpbusd_epi32(input_vector[0], row0[0]);
+ const IndexType kStart = 1;
+#endif
+
+ for (IndexType j = kStart; j < kNumChunks; ++j)
+ {
+ const __m256i in = input_vector[j];
+
+#if defined (USE_VNNI)
+ m256_add_dpbusd_epi32(sum0, in, row0[j]);
+#else
+ sum0 = _mm256_add_epi32(sum0, m256_dpbusd_epi32(in, row0[j]));
+#endif
+ }
+
+ output[0] = m256_hadd(sum0, biases_[0]);
+ }
+ else
+ {
+ // This case can never happen because kOutputDimensions
+ // is always 1 or a multiple of kSimdWidth.
+ assert(false);
+ }
+
+#elif defined (USE_SSSE3)
+
+ constexpr IndexType kNumChunks = kPaddedInputDimensions / kSimdWidth;
+
+ auto output = reinterpret_cast<OutputType*>(buffer);
+ const auto input_vector = reinterpret_cast<const __m128i*>(input);
+
+ // kOutputDimensions is either 1 or a multiple of kSimdWidth
+ // because then it is also an input dimension.
+ if constexpr (kOutputDimensions % 4 == 0)
+ {
+ for (IndexType i = 0; i < kOutputDimensions; i += 4)
+ {
+ const IndexType offset0 = (i + 0) * kPaddedInputDimensions;
+ const IndexType offset1 = (i + 1) * kPaddedInputDimensions;
+ const IndexType offset2 = (i + 2) * kPaddedInputDimensions;
+ const IndexType offset3 = (i + 3) * kPaddedInputDimensions;
+
+ const __m128i bias = *reinterpret_cast<const __m128i*>(&biases_[i]);
+ __m128i* outptr = reinterpret_cast<__m128i*>(&output[i]);
+
+ const auto row0 = reinterpret_cast<const __m128i*>(&weights_[offset0]);
+ const auto row1 = reinterpret_cast<const __m128i*>(&weights_[offset1]);
+ const auto row2 = reinterpret_cast<const __m128i*>(&weights_[offset2]);
+ const auto row3 = reinterpret_cast<const __m128i*>(&weights_[offset3]);
+
+ __m128i sum0 = m128_dpbusd_epi32(input_vector[0], row0[0]);
+ __m128i sum1 = m128_dpbusd_epi32(input_vector[0], row1[0]);
+ __m128i sum2 = m128_dpbusd_epi32(input_vector[0], row2[0]);
+ __m128i sum3 = m128_dpbusd_epi32(input_vector[0], row3[0]);
+
+ for (int j = 1; j < (int)kNumChunks; ++j)
+ {
+ const __m128i in = input_vector[j];
+
+ sum0 = _mm_add_epi32(sum0, m128_dpbusd_epi32(in, row0[j]));
+ sum1 = _mm_add_epi32(sum1, m128_dpbusd_epi32(in, row1[j]));
+ sum2 = _mm_add_epi32(sum2, m128_dpbusd_epi32(in, row2[j]));
+ sum3 = _mm_add_epi32(sum3, m128_dpbusd_epi32(in, row3[j]));
+ }
+
+ *outptr = m128_haddx4(sum0, sum1, sum2, sum3, bias);
+ }
+ }
+ else if constexpr (kOutputDimensions == 1)
+ {
+ const auto row0 = reinterpret_cast<const __m128i*>(&weights_[0]);
+
+ __m128i sum0 = m128_dpbusd_epi32(input_vector[0], row0[0]);
+
+ for (int j = 1; j < (int)kNumChunks; ++j)
+ sum0 = _mm_add_epi32(sum0, m128_dpbusd_epi32(input_vector[j], row0[j]));
+
+ output[0] = m128_hadd(sum0, biases_[0]);
+ }
+ else
+ {
+ // This case can never happen because kOutputDimensions
+ // is always 1 or a multiple of kSimdWidth.
+ assert(false);
+ }
+
+#else
+
+// Use old implementation for the other architectures.
+
+ auto output = reinterpret_cast<OutputType*>(buffer);
+
+#if defined(USE_SSE2)
constexpr IndexType kNumChunks = kPaddedInputDimensions / kSimdWidth;
- #ifndef USE_SSSE3
+#ifndef USE_SSSE3
const __m128i kZeros = _mm_setzero_si128();
- #else
+#else
const __m128i kOnes = _mm_set1_epi16(1);
- #endif
+#endif
const auto input_vector = reinterpret_cast<const __m128i*>(input);
- #elif defined(USE_MMX)
+#elif defined(USE_MMX)
constexpr IndexType kNumChunks = kPaddedInputDimensions / kSimdWidth;
const __m64 kZeros = _mm_setzero_si64();
const auto input_vector = reinterpret_cast<const __m64*>(input);
- #elif defined(USE_NEON)
+#elif defined(USE_NEON)
constexpr IndexType kNumChunks = kPaddedInputDimensions / kSimdWidth;
const auto input_vector = reinterpret_cast<const int8x8_t*>(input);
- #endif
+#endif
for (IndexType i = 0; i < kOutputDimensions; ++i) {
const IndexType offset = i * kPaddedInputDimensions;
- #if defined(USE_AVX512)
- __m512i sum = _mm512_setzero_si512();
- const auto row = reinterpret_cast<const __m512i*>(&weights_[offset]);
- for (IndexType j = 0; j < kNumChunks; ++j) {
- #if defined(USE_VNNI)
- sum = _mm512_dpbusd_epi32(sum, _mm512_loadA_si512(&input_vector[j]), _mm512_load_si512(&row[j]));
- #else
- __m512i product = _mm512_maddubs_epi16(_mm512_loadA_si512(&input_vector[j]), _mm512_load_si512(&row[j]));
- product = _mm512_madd_epi16(product, kOnes);
- sum = _mm512_add_epi32(sum, product);
- #endif
- }
-
- // Note: Changing kMaxSimdWidth from 32 to 64 breaks loading existing networks.
- // As a result kPaddedInputDimensions may not be an even multiple of 64(512bit)
- // and we have to do one more 256bit chunk.
- if (kPaddedInputDimensions != kNumChunks * kSimdWidth * 2)
- {
- const auto iv256 = reinterpret_cast<const __m256i*>(&input_vector[kNumChunks]);
- const auto row256 = reinterpret_cast<const __m256i*>(&row[kNumChunks]);
- #if defined(USE_VNNI)
- __m256i product256 = _mm256_dpbusd_epi32(
- _mm512_castsi512_si256(sum), _mm256_loadA_si256(&iv256[0]), _mm256_load_si256(&row256[0]));
- sum = _mm512_inserti32x8(sum, product256, 0);
- #else
- __m256i product256 = _mm256_maddubs_epi16(_mm256_loadA_si256(&iv256[0]), _mm256_load_si256(&row256[0]));
- sum = _mm512_add_epi32(sum, _mm512_cvtepi16_epi32(product256));
- #endif
- }
- output[i] = _mm512_reduce_add_epi32(sum) + biases_[i];
-
- #elif defined(USE_AVX2)
- __m256i sum = _mm256_setzero_si256();
- const auto row = reinterpret_cast<const __m256i*>(&weights_[offset]);
- for (IndexType j = 0; j < kNumChunks; ++j) {
- #if defined(USE_VNNI)
- sum = _mm256_dpbusd_epi32(sum, _mm256_loadA_si256(&input_vector[j]), _mm256_load_si256(&row[j]));
- #else
- __m256i product = _mm256_maddubs_epi16(_mm256_loadA_si256(&input_vector[j]), _mm256_load_si256(&row[j]));
- product = _mm256_madd_epi16(product, kOnes);
- sum = _mm256_add_epi32(sum, product);
- #endif
- }
- __m128i sum128 = _mm_add_epi32(_mm256_castsi256_si128(sum), _mm256_extracti128_si256(sum, 1));
- sum128 = _mm_add_epi32(sum128, _mm_shuffle_epi32(sum128, _MM_PERM_BADC));
- sum128 = _mm_add_epi32(sum128, _mm_shuffle_epi32(sum128, _MM_PERM_CDAB));
- output[i] = _mm_cvtsi128_si32(sum128) + biases_[i];
-
- #elif defined(USE_SSSE3)
- __m128i sum = _mm_setzero_si128();
- const auto row = reinterpret_cast<const __m128i*>(&weights_[offset]);
- for (int j = 0; j < (int)kNumChunks - 1; j += 2) {
- __m128i product0 = _mm_maddubs_epi16(_mm_load_si128(&input_vector[j]), _mm_load_si128(&row[j]));
- product0 = _mm_madd_epi16(product0, kOnes);
- sum = _mm_add_epi32(sum, product0);
- __m128i product1 = _mm_maddubs_epi16(_mm_load_si128(&input_vector[j+1]), _mm_load_si128(&row[j+1]));
- product1 = _mm_madd_epi16(product1, kOnes);
- sum = _mm_add_epi32(sum, product1);
- }
- if (kNumChunks & 0x1) {
- __m128i product = _mm_maddubs_epi16(_mm_load_si128(&input_vector[kNumChunks-1]), _mm_load_si128(&row[kNumChunks-1]));
- product = _mm_madd_epi16(product, kOnes);
- sum = _mm_add_epi32(sum, product);
- }
- sum = _mm_add_epi32(sum, _mm_shuffle_epi32(sum, 0x4E)); //_MM_PERM_BADC
- sum = _mm_add_epi32(sum, _mm_shuffle_epi32(sum, 0xB1)); //_MM_PERM_CDAB
- output[i] = _mm_cvtsi128_si32(sum) + biases_[i];
-
- #elif defined(USE_SSE2)
+#if defined(USE_SSE2)
__m128i sum_lo = _mm_cvtsi32_si128(biases_[i]);
__m128i sum_hi = kZeros;
const auto row = reinterpret_cast<const __m128i*>(&weights_[offset]);
sum = _mm_add_epi32(sum, sum_second_32);
output[i] = _mm_cvtsi128_si32(sum);
- #elif defined(USE_MMX)
+#elif defined(USE_MMX)
__m64 sum_lo = _mm_cvtsi32_si64(biases_[i]);
__m64 sum_hi = kZeros;
const auto row = reinterpret_cast<const __m64*>(&weights_[offset]);
sum = _mm_add_pi32(sum, _mm_unpackhi_pi32(sum, sum));
output[i] = _mm_cvtsi64_si32(sum);
- #elif defined(USE_NEON)
+#elif defined(USE_NEON)
int32x4_t sum = {biases_[i]};
const auto row = reinterpret_cast<const int8x8_t*>(&weights_[offset]);
for (IndexType j = 0; j < kNumChunks; ++j) {
}
output[i] = sum[0] + sum[1] + sum[2] + sum[3];
- #else
+#else
OutputType sum = biases_[i];
for (IndexType j = 0; j < kInputDimensions; ++j) {
sum += weights_[offset + j] * input[j];
}
output[i] = sum;
- #endif
+#endif
}
- #if defined(USE_MMX)
+#if defined(USE_MMX)
_mm_empty();
- #endif
+#endif
+
+#endif
+
return output;
}
const auto out = reinterpret_cast<__m256i*>(output);
for (IndexType i = 0; i < kNumChunks; ++i) {
const __m256i words0 = _mm256_srai_epi16(_mm256_packs_epi32(
- _mm256_loadA_si256(&in[i * 4 + 0]),
- _mm256_loadA_si256(&in[i * 4 + 1])), kWeightScaleBits);
+ _mm256_load_si256(&in[i * 4 + 0]),
+ _mm256_load_si256(&in[i * 4 + 1])), kWeightScaleBits);
const __m256i words1 = _mm256_srai_epi16(_mm256_packs_epi32(
- _mm256_loadA_si256(&in[i * 4 + 2]),
- _mm256_loadA_si256(&in[i * 4 + 3])), kWeightScaleBits);
- _mm256_storeA_si256(&out[i], _mm256_permutevar8x32_epi32(_mm256_max_epi8(
+ _mm256_load_si256(&in[i * 4 + 2]),
+ _mm256_load_si256(&in[i * 4 + 3])), kWeightScaleBits);
+ _mm256_store_si256(&out[i], _mm256_permutevar8x32_epi32(_mm256_max_epi8(
_mm256_packs_epi16(words0, words1), kZero), kOffsets));
}
constexpr IndexType kStart = kNumChunks * kSimdWidth;
namespace Eval::NNUE {
+ // The accumulator of a StateInfo without parent is set to the INIT state
+ enum AccumulatorState { EMPTY, COMPUTED, INIT };
+
// Class that holds the result of affine transformation of input features
struct alignas(kCacheLineSize) Accumulator {
std::int16_t
accumulation[2][kRefreshTriggers.size()][kTransformedFeatureDimensions];
- bool computed_accumulation;
+ AccumulatorState state[2];
};
} // namespace Eval::NNUE
#include <arm_neon.h>
#endif
-// HACK: Use _mm256_loadu_si256() instead of _mm256_load_si256. Otherwise a binary
-// compiled with older g++ crashes because the output memory is not aligned
-// even though alignas is specified.
-#if defined(USE_AVX2)
-#if defined(__GNUC__ ) && (__GNUC__ < 9) && defined(_WIN32) && !defined(__clang__)
-#define _mm256_loadA_si256 _mm256_loadu_si256
-#define _mm256_storeA_si256 _mm256_storeu_si256
-#else
-#define _mm256_loadA_si256 _mm256_load_si256
-#define _mm256_storeA_si256 _mm256_store_si256
-#endif
-#endif
-
-#if defined(USE_AVX512)
-#if defined(__GNUC__ ) && (__GNUC__ < 9) && defined(_WIN32) && !defined(__clang__)
-#define _mm512_loadA_si512 _mm512_loadu_si512
-#define _mm512_storeA_si512 _mm512_storeu_si512
-#else
-#define _mm512_loadA_si512 _mm512_load_si512
-#define _mm512_storeA_si512 _mm512_store_si512
-#endif
-#endif
-
namespace Eval::NNUE {
// Version of the evaluation file
PS_END2 = 12 * SQUARE_NB + 1
};
- extern uint32_t kpp_board_index[PIECE_NB][COLOR_NB];
+ constexpr uint32_t kpp_board_index[COLOR_NB][PIECE_NB] = {
+ // convention: W - us, B - them
+ // viewed from other side, W and B are reversed
+ { PS_NONE, PS_W_PAWN, PS_W_KNIGHT, PS_W_BISHOP, PS_W_ROOK, PS_W_QUEEN, PS_W_KING, PS_NONE,
+ PS_NONE, PS_B_PAWN, PS_B_KNIGHT, PS_B_BISHOP, PS_B_ROOK, PS_B_QUEEN, PS_B_KING, PS_NONE },
+ { PS_NONE, PS_B_PAWN, PS_B_KNIGHT, PS_B_BISHOP, PS_B_ROOK, PS_B_QUEEN, PS_B_KING, PS_NONE,
+ PS_NONE, PS_W_PAWN, PS_W_KNIGHT, PS_W_BISHOP, PS_W_ROOK, PS_W_QUEEN, PS_W_KING, PS_NONE }
+ };
// Type of input feature after conversion
using TransformedFeatureType = std::uint8_t;
namespace Eval::NNUE {
+ // If vector instructions are enabled, we update and refresh the
+ // accumulator tile by tile such that each tile fits in the CPU's
+ // vector registers.
+ #define VECTOR
+
+ #ifdef USE_AVX512
+ typedef __m512i vec_t;
+ #define vec_load(a) _mm512_load_si512(a)
+ #define vec_store(a,b) _mm512_store_si512(a,b)
+ #define vec_add_16(a,b) _mm512_add_epi16(a,b)
+ #define vec_sub_16(a,b) _mm512_sub_epi16(a,b)
+ static constexpr IndexType kNumRegs = 8; // only 8 are needed
+
+ #elif USE_AVX2
+ typedef __m256i vec_t;
+ #define vec_load(a) _mm256_load_si256(a)
+ #define vec_store(a,b) _mm256_store_si256(a,b)
+ #define vec_add_16(a,b) _mm256_add_epi16(a,b)
+ #define vec_sub_16(a,b) _mm256_sub_epi16(a,b)
+ static constexpr IndexType kNumRegs = 16;
+
+ #elif USE_SSE2
+ typedef __m128i vec_t;
+ #define vec_load(a) (*(a))
+ #define vec_store(a,b) *(a)=(b)
+ #define vec_add_16(a,b) _mm_add_epi16(a,b)
+ #define vec_sub_16(a,b) _mm_sub_epi16(a,b)
+ static constexpr IndexType kNumRegs = Is64Bit ? 16 : 8;
+
+ #elif USE_MMX
+ typedef __m64 vec_t;
+ #define vec_load(a) (*(a))
+ #define vec_store(a,b) *(a)=(b)
+ #define vec_add_16(a,b) _mm_add_pi16(a,b)
+ #define vec_sub_16(a,b) _mm_sub_pi16(a,b)
+ static constexpr IndexType kNumRegs = 8;
+
+ #elif USE_NEON
+ typedef int16x8_t vec_t;
+ #define vec_load(a) (*(a))
+ #define vec_store(a,b) *(a)=(b)
+ #define vec_add_16(a,b) vaddq_s16(a,b)
+ #define vec_sub_16(a,b) vsubq_s16(a,b)
+ static constexpr IndexType kNumRegs = 16;
+
+ #else
+ #undef VECTOR
+
+ #endif
+
// Input feature converter
class FeatureTransformer {
// Number of output dimensions for one side
static constexpr IndexType kHalfDimensions = kTransformedFeatureDimensions;
+ #ifdef VECTOR
+ static constexpr IndexType kTileHeight = kNumRegs * sizeof(vec_t) / 2;
+ static_assert(kHalfDimensions % kTileHeight == 0, "kTileHeight must divide kHalfDimensions");
+ #endif
+
public:
// Output type
using OutputType = TransformedFeatureType;
return !stream.fail();
}
- // Proceed with the difference calculation if possible
- bool UpdateAccumulatorIfPossible(const Position& pos) const {
-
- const auto now = pos.state();
- if (now->accumulator.computed_accumulation)
- return true;
-
- const auto prev = now->previous;
- if (prev && prev->accumulator.computed_accumulation) {
- UpdateAccumulator(pos);
- return true;
- }
-
- return false;
- }
-
// Convert input features
void Transform(const Position& pos, OutputType* output) const {
- if (!UpdateAccumulatorIfPossible(pos))
- RefreshAccumulator(pos);
+ UpdateAccumulator(pos, WHITE);
+ UpdateAccumulator(pos, BLACK);
const auto& accumulation = pos.state()->accumulator.accumulation;
- #if defined(USE_AVX2)
+ #if defined(USE_AVX512)
+ constexpr IndexType kNumChunks = kHalfDimensions / (kSimdWidth * 2);
+ static_assert(kHalfDimensions % (kSimdWidth * 2) == 0);
+ const __m512i kControl = _mm512_setr_epi64(0, 2, 4, 6, 1, 3, 5, 7);
+ const __m512i kZero = _mm512_setzero_si512();
+
+ #elif defined(USE_AVX2)
constexpr IndexType kNumChunks = kHalfDimensions / kSimdWidth;
constexpr int kControl = 0b11011000;
const __m256i kZero = _mm256_setzero_si256();
for (IndexType p = 0; p < 2; ++p) {
const IndexType offset = kHalfDimensions * p;
- #if defined(USE_AVX2)
+ #if defined(USE_AVX512)
+ auto out = reinterpret_cast<__m512i*>(&output[offset]);
+ for (IndexType j = 0; j < kNumChunks; ++j) {
+ __m512i sum0 = _mm512_load_si512(
+ &reinterpret_cast<const __m512i*>(accumulation[perspectives[p]][0])[j * 2 + 0]);
+ __m512i sum1 = _mm512_load_si512(
+ &reinterpret_cast<const __m512i*>(accumulation[perspectives[p]][0])[j * 2 + 1]);
+ _mm512_store_si512(&out[j], _mm512_permutexvar_epi64(kControl,
+ _mm512_max_epi8(_mm512_packs_epi16(sum0, sum1), kZero)));
+ }
+
+ #elif defined(USE_AVX2)
auto out = reinterpret_cast<__m256i*>(&output[offset]);
for (IndexType j = 0; j < kNumChunks; ++j) {
- __m256i sum0 = _mm256_loadA_si256(
+ __m256i sum0 = _mm256_load_si256(
&reinterpret_cast<const __m256i*>(accumulation[perspectives[p]][0])[j * 2 + 0]);
- __m256i sum1 = _mm256_loadA_si256(
- &reinterpret_cast<const __m256i*>(accumulation[perspectives[p]][0])[j * 2 + 1]);
- _mm256_storeA_si256(&out[j], _mm256_permute4x64_epi64(_mm256_max_epi8(
+ __m256i sum1 = _mm256_load_si256(
+ &reinterpret_cast<const __m256i*>(accumulation[perspectives[p]][0])[j * 2 + 1]);
+ _mm256_store_si256(&out[j], _mm256_permute4x64_epi64(_mm256_max_epi8(
_mm256_packs_epi16(sum0, sum1), kZero), kControl));
}
_mm_store_si128(&out[j],
#ifdef USE_SSE41
- _mm_max_epi8(packedbytes, kZero)
+ _mm_max_epi8(packedbytes, kZero)
#else
- _mm_subs_epi8(_mm_adds_epi8(packedbytes, k0x80s), k0x80s)
+ _mm_subs_epi8(_mm_adds_epi8(packedbytes, k0x80s), k0x80s)
#endif
);
}
private:
- // Calculate cumulative value without using difference calculation
- void RefreshAccumulator(const Position& pos) const {
-
- auto& accumulator = pos.state()->accumulator;
- IndexType i = 0;
- Features::IndexList active_indices[2];
- RawFeatures::AppendActiveIndices(pos, kRefreshTriggers[i],
- active_indices);
- for (Color perspective : { WHITE, BLACK }) {
- std::memcpy(accumulator.accumulation[perspective][i], biases_,
- kHalfDimensions * sizeof(BiasType));
- for (const auto index : active_indices[perspective]) {
- const IndexType offset = kHalfDimensions * index;
- #if defined(USE_AVX512)
- auto accumulation = reinterpret_cast<__m512i*>(
- &accumulator.accumulation[perspective][i][0]);
- auto column = reinterpret_cast<const __m512i*>(&weights_[offset]);
- constexpr IndexType kNumChunks = kHalfDimensions / kSimdWidth;
- for (IndexType j = 0; j < kNumChunks; ++j)
- _mm512_storeA_si512(&accumulation[j], _mm512_add_epi16(_mm512_loadA_si512(&accumulation[j]), column[j]));
-
- #elif defined(USE_AVX2)
- auto accumulation = reinterpret_cast<__m256i*>(
- &accumulator.accumulation[perspective][i][0]);
- auto column = reinterpret_cast<const __m256i*>(&weights_[offset]);
- constexpr IndexType kNumChunks = kHalfDimensions / (kSimdWidth / 2);
- for (IndexType j = 0; j < kNumChunks; ++j)
- _mm256_storeA_si256(&accumulation[j], _mm256_add_epi16(_mm256_loadA_si256(&accumulation[j]), column[j]));
-
- #elif defined(USE_SSE2)
- auto accumulation = reinterpret_cast<__m128i*>(
- &accumulator.accumulation[perspective][i][0]);
- auto column = reinterpret_cast<const __m128i*>(&weights_[offset]);
- constexpr IndexType kNumChunks = kHalfDimensions / (kSimdWidth / 2);
- for (IndexType j = 0; j < kNumChunks; ++j)
- accumulation[j] = _mm_add_epi16(accumulation[j], column[j]);
-
- #elif defined(USE_MMX)
- auto accumulation = reinterpret_cast<__m64*>(
- &accumulator.accumulation[perspective][i][0]);
- auto column = reinterpret_cast<const __m64*>(&weights_[offset]);
- constexpr IndexType kNumChunks = kHalfDimensions / (kSimdWidth / 2);
- for (IndexType j = 0; j < kNumChunks; ++j)
- accumulation[j] = _mm_add_pi16(accumulation[j], column[j]);
-
- #elif defined(USE_NEON)
- auto accumulation = reinterpret_cast<int16x8_t*>(
- &accumulator.accumulation[perspective][i][0]);
- auto column = reinterpret_cast<const int16x8_t*>(&weights_[offset]);
- constexpr IndexType kNumChunks = kHalfDimensions / (kSimdWidth / 2);
- for (IndexType j = 0; j < kNumChunks; ++j)
- accumulation[j] = vaddq_s16(accumulation[j], column[j]);
+ void UpdateAccumulator(const Position& pos, const Color c) const {
- #else
- for (IndexType j = 0; j < kHalfDimensions; ++j)
- accumulator.accumulation[perspective][i][j] += weights_[offset + j];
+ #ifdef VECTOR
+ // Gcc-10.2 unnecessarily spills AVX2 registers if this array
+ // is defined in the VECTOR code below, once in each branch
+ vec_t acc[kNumRegs];
#endif
- }
+ // Look for a usable accumulator of an earlier position. We keep track
+ // of the estimated gain in terms of features to be added/subtracted.
+ StateInfo *st = pos.state(), *next = nullptr;
+ int gain = pos.count<ALL_PIECES>() - 2;
+ while (st->accumulator.state[c] == EMPTY)
+ {
+ auto& dp = st->dirtyPiece;
+ // The first condition tests whether an incremental update is
+ // possible at all: if this side's king has moved, it is not possible.
+ static_assert(std::is_same_v<RawFeatures::SortedTriggerSet,
+ Features::CompileTimeList<Features::TriggerEvent, Features::TriggerEvent::kFriendKingMoved>>,
+ "Current code assumes that only kFriendlyKingMoved refresh trigger is being used.");
+ if ( dp.piece[0] == make_piece(c, KING)
+ || (gain -= dp.dirty_num + 1) < 0)
+ break;
+ next = st;
+ st = st->previous;
}
- #if defined(USE_MMX)
- _mm_empty();
- #endif
-
- accumulator.computed_accumulation = true;
- }
-
- // Calculate cumulative value using difference calculation
- void UpdateAccumulator(const Position& pos) const {
-
- const auto prev_accumulator = pos.state()->previous->accumulator;
- auto& accumulator = pos.state()->accumulator;
- IndexType i = 0;
- Features::IndexList removed_indices[2], added_indices[2];
- bool reset[2];
- RawFeatures::AppendChangedIndices(pos, kRefreshTriggers[i],
- removed_indices, added_indices, reset);
- for (Color perspective : { WHITE, BLACK }) {
- #if defined(USE_AVX2)
- constexpr IndexType kNumChunks = kHalfDimensions / (kSimdWidth / 2);
- auto accumulation = reinterpret_cast<__m256i*>(
- &accumulator.accumulation[perspective][i][0]);
-
- #elif defined(USE_SSE2)
- constexpr IndexType kNumChunks = kHalfDimensions / (kSimdWidth / 2);
- auto accumulation = reinterpret_cast<__m128i*>(
- &accumulator.accumulation[perspective][i][0]);
-
- #elif defined(USE_MMX)
- constexpr IndexType kNumChunks = kHalfDimensions / (kSimdWidth / 2);
- auto accumulation = reinterpret_cast<__m64*>(
- &accumulator.accumulation[perspective][i][0]);
+ if (st->accumulator.state[c] == COMPUTED)
+ {
+ if (next == nullptr)
+ return;
+
+ // Update incrementally in two steps. First, we update the "next"
+ // accumulator. Then, we update the current accumulator (pos.state()).
+
+ // Gather all features to be updated. This code assumes HalfKP features
+ // only and doesn't support refresh triggers.
+ static_assert(std::is_same_v<Features::FeatureSet<Features::HalfKP<Features::Side::kFriend>>,
+ RawFeatures>);
+ Features::IndexList removed[2], added[2];
+ Features::HalfKP<Features::Side::kFriend>::AppendChangedIndices(pos,
+ next->dirtyPiece, c, &removed[0], &added[0]);
+ for (StateInfo *st2 = pos.state(); st2 != next; st2 = st2->previous)
+ Features::HalfKP<Features::Side::kFriend>::AppendChangedIndices(pos,
+ st2->dirtyPiece, c, &removed[1], &added[1]);
+
+ // Mark the accumulators as computed.
+ next->accumulator.state[c] = COMPUTED;
+ pos.state()->accumulator.state[c] = COMPUTED;
+
+ // Now update the accumulators listed in info[], where the last element is a sentinel.
+ StateInfo *info[3] =
+ { next, next == pos.state() ? nullptr : pos.state(), nullptr };
+ #ifdef VECTOR
+ for (IndexType j = 0; j < kHalfDimensions / kTileHeight; ++j)
+ {
+ // Load accumulator
+ auto accTile = reinterpret_cast<vec_t*>(
+ &st->accumulator.accumulation[c][0][j * kTileHeight]);
+ for (IndexType k = 0; k < kNumRegs; ++k)
+ acc[k] = vec_load(&accTile[k]);
+
+ for (IndexType i = 0; info[i]; ++i)
+ {
+ // Difference calculation for the deactivated features
+ for (const auto index : removed[i])
+ {
+ const IndexType offset = kHalfDimensions * index + j * kTileHeight;
+ auto column = reinterpret_cast<const vec_t*>(&weights_[offset]);
+ for (IndexType k = 0; k < kNumRegs; ++k)
+ acc[k] = vec_sub_16(acc[k], column[k]);
+ }
+
+ // Difference calculation for the activated features
+ for (const auto index : added[i])
+ {
+ const IndexType offset = kHalfDimensions * index + j * kTileHeight;
+ auto column = reinterpret_cast<const vec_t*>(&weights_[offset]);
+ for (IndexType k = 0; k < kNumRegs; ++k)
+ acc[k] = vec_add_16(acc[k], column[k]);
+ }
+
+ // Store accumulator
+ accTile = reinterpret_cast<vec_t*>(
+ &info[i]->accumulator.accumulation[c][0][j * kTileHeight]);
+ for (IndexType k = 0; k < kNumRegs; ++k)
+ vec_store(&accTile[k], acc[k]);
+ }
+ }
- #elif defined(USE_NEON)
- constexpr IndexType kNumChunks = kHalfDimensions / (kSimdWidth / 2);
- auto accumulation = reinterpret_cast<int16x8_t*>(
- &accumulator.accumulation[perspective][i][0]);
- #endif
+ #else
+ for (IndexType i = 0; info[i]; ++i)
+ {
+ std::memcpy(info[i]->accumulator.accumulation[c][0],
+ st->accumulator.accumulation[c][0],
+ kHalfDimensions * sizeof(BiasType));
+ st = info[i];
- if (reset[perspective]) {
- std::memcpy(accumulator.accumulation[perspective][i], biases_,
- kHalfDimensions * sizeof(BiasType));
- } else {
- std::memcpy(accumulator.accumulation[perspective][i],
- prev_accumulator.accumulation[perspective][i],
- kHalfDimensions * sizeof(BiasType));
// Difference calculation for the deactivated features
- for (const auto index : removed_indices[perspective]) {
+ for (const auto index : removed[i])
+ {
const IndexType offset = kHalfDimensions * index;
- #if defined(USE_AVX2)
- auto column = reinterpret_cast<const __m256i*>(&weights_[offset]);
- for (IndexType j = 0; j < kNumChunks; ++j)
- accumulation[j] = _mm256_sub_epi16(accumulation[j], column[j]);
-
- #elif defined(USE_SSE2)
- auto column = reinterpret_cast<const __m128i*>(&weights_[offset]);
- for (IndexType j = 0; j < kNumChunks; ++j)
- accumulation[j] = _mm_sub_epi16(accumulation[j], column[j]);
-
- #elif defined(USE_MMX)
- auto column = reinterpret_cast<const __m64*>(&weights_[offset]);
- for (IndexType j = 0; j < kNumChunks; ++j)
- accumulation[j] = _mm_sub_pi16(accumulation[j], column[j]);
+ for (IndexType j = 0; j < kHalfDimensions; ++j)
+ st->accumulator.accumulation[c][0][j] -= weights_[offset + j];
+ }
- #elif defined(USE_NEON)
- auto column = reinterpret_cast<const int16x8_t*>(&weights_[offset]);
- for (IndexType j = 0; j < kNumChunks; ++j)
- accumulation[j] = vsubq_s16(accumulation[j], column[j]);
+ // Difference calculation for the activated features
+ for (const auto index : added[i])
+ {
+ const IndexType offset = kHalfDimensions * index;
- #else
for (IndexType j = 0; j < kHalfDimensions; ++j)
- accumulator.accumulation[perspective][i][j] -= weights_[offset + j];
- #endif
-
+ st->accumulator.accumulation[c][0][j] += weights_[offset + j];
}
}
- { // Difference calculation for the activated features
- for (const auto index : added_indices[perspective]) {
- const IndexType offset = kHalfDimensions * index;
-
- #if defined(USE_AVX2)
- auto column = reinterpret_cast<const __m256i*>(&weights_[offset]);
- for (IndexType j = 0; j < kNumChunks; ++j)
- accumulation[j] = _mm256_add_epi16(accumulation[j], column[j]);
-
- #elif defined(USE_SSE2)
- auto column = reinterpret_cast<const __m128i*>(&weights_[offset]);
- for (IndexType j = 0; j < kNumChunks; ++j)
- accumulation[j] = _mm_add_epi16(accumulation[j], column[j]);
-
- #elif defined(USE_MMX)
- auto column = reinterpret_cast<const __m64*>(&weights_[offset]);
- for (IndexType j = 0; j < kNumChunks; ++j)
- accumulation[j] = _mm_add_pi16(accumulation[j], column[j]);
+ #endif
+ }
+ else
+ {
+ // Refresh the accumulator
+ auto& accumulator = pos.state()->accumulator;
+ accumulator.state[c] = COMPUTED;
+ Features::IndexList active;
+ Features::HalfKP<Features::Side::kFriend>::AppendActiveIndices(pos, c, &active);
+
+ #ifdef VECTOR
+ for (IndexType j = 0; j < kHalfDimensions / kTileHeight; ++j)
+ {
+ auto biasesTile = reinterpret_cast<const vec_t*>(
+ &biases_[j * kTileHeight]);
+ for (IndexType k = 0; k < kNumRegs; ++k)
+ acc[k] = biasesTile[k];
+
+ for (const auto index : active)
+ {
+ const IndexType offset = kHalfDimensions * index + j * kTileHeight;
+ auto column = reinterpret_cast<const vec_t*>(&weights_[offset]);
+
+ for (unsigned k = 0; k < kNumRegs; ++k)
+ acc[k] = vec_add_16(acc[k], column[k]);
+ }
- #elif defined(USE_NEON)
- auto column = reinterpret_cast<const int16x8_t*>(&weights_[offset]);
- for (IndexType j = 0; j < kNumChunks; ++j)
- accumulation[j] = vaddq_s16(accumulation[j], column[j]);
+ auto accTile = reinterpret_cast<vec_t*>(
+ &accumulator.accumulation[c][0][j * kTileHeight]);
+ for (unsigned k = 0; k < kNumRegs; k++)
+ vec_store(&accTile[k], acc[k]);
+ }
#else
- for (IndexType j = 0; j < kHalfDimensions; ++j)
- accumulator.accumulation[perspective][i][j] += weights_[offset + j];
- #endif
+ std::memcpy(accumulator.accumulation[c][0], biases_,
+ kHalfDimensions * sizeof(BiasType));
- }
+ for (const auto index : active)
+ {
+ const IndexType offset = kHalfDimensions * index;
+
+ for (IndexType j = 0; j < kHalfDimensions; ++j)
+ accumulator.accumulation[c][0][j] += weights_[offset + j];
}
+ #endif
}
+
#if defined(USE_MMX)
_mm_empty();
#endif
-
- accumulator.computed_accumulation = true;
}
using BiasType = std::int16_t;
#define S(mg, eg) make_score(mg, eg)
// Pawn penalties
- constexpr Score Backward = S( 8, 27);
- constexpr Score Doubled = S(11, 55);
- constexpr Score Isolated = S( 5, 17);
- constexpr Score WeakLever = S( 2, 54);
- constexpr Score WeakUnopposed = S(15, 25);
+ constexpr Score Backward = S( 8, 25);
+ constexpr Score Doubled = S(10, 55);
+ constexpr Score Isolated = S( 3, 15);
+ constexpr Score WeakLever = S( 3, 55);
+ constexpr Score WeakUnopposed = S(13, 25);
// Bonus for blocked pawns at 5th or 6th rank
- constexpr Score BlockedPawn[2] = { S(-13, -4), S(-4, 3) };
+ constexpr Score BlockedPawn[2] = { S(-13, -4), S(-5, 2) };
constexpr Score BlockedStorm[RANK_NB] = {
S(0, 0), S(0, 0), S(76, 78), S(-10, 15), S(-7, 10), S(-4, 6), S(-1, 2)
};
// Connected pawn bonus
- constexpr int Connected[RANK_NB] = { 0, 7, 8, 11, 24, 45, 85 };
+ constexpr int Connected[RANK_NB] = { 0, 5, 7, 11, 24, 48, 86 };
// Strength of pawn shelter for our king by [distance from edge][rank].
// RANK_1 = 0 is used for files where we have no pawn, or pawn is behind our king.
constexpr Value ShelterStrength[int(FILE_NB) / 2][RANK_NB] = {
- { V( -6), V( 81), V( 93), V( 58), V( 39), V( 18), V( 25) },
- { V(-43), V( 61), V( 35), V(-49), V(-29), V(-11), V( -63) },
- { V(-10), V( 75), V( 23), V( -2), V( 32), V( 3), V( -45) },
- { V(-39), V(-13), V(-29), V(-52), V(-48), V(-67), V(-166) }
+ { V( -5), V( 82), V( 92), V( 54), V( 36), V( 22), V( 28) },
+ { V(-44), V( 63), V( 33), V(-50), V(-30), V(-12), V( -62) },
+ { V(-11), V( 77), V( 22), V( -6), V( 31), V( 8), V( -45) },
+ { V(-39), V(-12), V(-29), V(-50), V(-43), V(-68), V(-164) }
};
// Danger of enemy pawns moving toward our king by [distance from edge][rank].
// is behind our king. Note that UnblockedStorm[0][1-2] accommodate opponent pawn
// on edge, likely blocked by our king.
constexpr Value UnblockedStorm[int(FILE_NB) / 2][RANK_NB] = {
- { V( 85), V(-289), V(-166), V(97), V(50), V( 45), V( 50) },
- { V( 46), V( -25), V( 122), V(45), V(37), V(-10), V( 20) },
- { V( -6), V( 51), V( 168), V(34), V(-2), V(-22), V(-14) },
- { V(-15), V( -11), V( 101), V( 4), V(11), V(-15), V(-29) }
+ { V( 87), V(-288), V(-168), V( 96), V( 47), V( 44), V( 46) },
+ { V( 42), V( -25), V( 120), V( 45), V( 34), V( -9), V( 24) },
+ { V( -8), V( 51), V( 167), V( 35), V( -4), V(-16), V(-12) },
+ { V(-17), V( -13), V( 100), V( 4), V( 9), V(-16), V(-31) }
};
+ // KingOnFile[semi-open Us][semi-open Them] contains bonuses/penalties
+ // for king when the king is on a semi-open or open file.
+ constexpr Score KingOnFile[2][2] = {{ S(-19,12), S(-6, 7) },
+ { S( 0, 2), S( 6,-5) }};
+
#undef S
#undef V
Square s;
bool backward, passed, doubled;
Score score = SCORE_ZERO;
- const Square* pl = pos.squares<PAWN>(Us);
+ Bitboard b = pos.pieces(Us, PAWN);
Bitboard ourPawns = pos.pieces( Us, PAWN);
Bitboard theirPawns = pos.pieces(Them, PAWN);
e->blockedCount += popcount(shift<Up>(ourPawns) & (theirPawns | doubleAttackThem));
// Loop through all pawns of the current color and score each pawn
- while ((s = *pl++) != SQ_NONE)
- {
+ while (b) {
+ s = pop_lsb(&b);
+
assert(pos.piece_on(s) == make_piece(Us, PAWN));
Rank r = relative_rank(Us, s);
if (support | phalanx)
{
int v = Connected[r] * (2 + bool(phalanx) - bool(opposed))
- + 21 * popcount(support);
+ + 22 * popcount(support);
score += make_score(v, v * (r - 2) / 4);
}
score -= Doubled * doubled
+ WeakLever * more_than_one(lever);
- if (blocked && r > RANK_4)
- score += BlockedPawn[r-4];
+ if (blocked && r >= RANK_5)
+ score += BlockedPawn[r - RANK_5];
}
return score;
bonus -= make_score(UnblockedStorm[d][theirRank], 0);
}
+ // King On File
+ bonus -= KingOnFile[pos.is_on_semiopen_file(Us, ksq)][pos.is_on_semiopen_file(Them, ksq)];
+
return bonus;
}
&& !pos.can_castle(ANY_CASTLING))
{
StateInfo st;
+ ASSERT_ALIGNED(&st, Eval::NNUE::kCacheLineSize);
+
Position p;
p.set(pos.fen(), pos.is_chess960(), &st, pos.this_thread());
Tablebases::ProbeState s1, s2;
std::memset(this, 0, sizeof(Position));
std::memset(si, 0, sizeof(StateInfo));
- std::fill_n(&pieceList[0][0], sizeof(pieceList) / sizeof(Square), SQ_NONE);
st = si;
ss >> std::noskipws;
chess960 = isChess960;
thisThread = th;
set_state(st);
+ st->accumulator.state[WHITE] = Eval::NNUE::INIT;
+ st->accumulator.state[BLACK] = Eval::NNUE::INIT;
return *this;
}
++st->pliesFromNull;
// Used by NNUE
- st->accumulator.computed_accumulation = false;
+ st->accumulator.state[WHITE] = Eval::NNUE::EMPTY;
+ st->accumulator.state[BLACK] = Eval::NNUE::EMPTY;
auto& dp = st->dirtyPiece;
dp.dirty_num = 1;
assert(!checkers());
assert(&newSt != st);
- if (Eval::useNNUE)
- {
- std::memcpy(&newSt, st, sizeof(StateInfo));
- }
- else
- std::memcpy(&newSt, st, offsetof(StateInfo, accumulator));
+ std::memcpy(&newSt, st, offsetof(StateInfo, accumulator));
newSt.previous = st;
st = &newSt;
+ st->dirtyPiece.dirty_num = 0;
+ st->dirtyPiece.piece[0] = NO_PIECE; // Avoid checks in UpdateAccumulator()
+ st->accumulator.state[WHITE] = Eval::NNUE::EMPTY;
+ st->accumulator.state[BLACK] = Eval::NNUE::EMPTY;
+
if (st->epSquare != SQ_NONE)
{
st->key ^= Zobrist::enpassant[file_of(st->epSquare)];
assert(0 && "pos_is_ok: Bitboards");
StateInfo si = *st;
+ ASSERT_ALIGNED(&si, Eval::NNUE::kCacheLineSize);
+
set_state(&si);
if (std::memcmp(&si, st, sizeof(StateInfo)))
assert(0 && "pos_is_ok: State");
for (Piece pc : Pieces)
- {
if ( pieceCount[pc] != popcount(pieces(color_of(pc), type_of(pc)))
|| pieceCount[pc] != std::count(board, board + SQUARE_NB, pc))
assert(0 && "pos_is_ok: Pieces");
- for (int i = 0; i < pieceCount[pc]; ++i)
- if (board[pieceList[pc][i]] != pc || index[pieceList[pc][i]] != i)
- assert(0 && "pos_is_ok: Index");
- }
-
for (Color c : { WHITE, BLACK })
for (CastlingRights cr : {c & KING_SIDE, c & QUEEN_SIDE})
{
bool empty(Square s) const;
template<PieceType Pt> int count(Color c) const;
template<PieceType Pt> int count() const;
- template<PieceType Pt> const Square* squares(Color c) const;
template<PieceType Pt> Square square(Color c) const;
bool is_on_semiopen_file(Color c, Square s) const;
Bitboard byTypeBB[PIECE_TYPE_NB];
Bitboard byColorBB[COLOR_NB];
int pieceCount[PIECE_NB];
- Square pieceList[PIECE_NB][16];
- int index[SQUARE_NB];
int castlingRightsMask[SQUARE_NB];
Square castlingRookSquare[CASTLING_RIGHT_NB];
Bitboard castlingPath[CASTLING_RIGHT_NB];
return count<Pt>(WHITE) + count<Pt>(BLACK);
}
-template<PieceType Pt> inline const Square* Position::squares(Color c) const {
- return pieceList[make_piece(c, Pt)];
-}
-
template<PieceType Pt> inline Square Position::square(Color c) const {
- assert(pieceCount[make_piece(c, Pt)] == 1);
- return squares<Pt>(c)[0];
+ assert(count<Pt>(c) == 1);
+ return lsb(pieces(c, Pt));
}
inline Square Position::ep_square() const {
board[s] = pc;
byTypeBB[ALL_PIECES] |= byTypeBB[type_of(pc)] |= s;
byColorBB[color_of(pc)] |= s;
- index[s] = pieceCount[pc]++;
- pieceList[pc][index[s]] = s;
+ pieceCount[pc]++;
pieceCount[make_piece(color_of(pc), ALL_PIECES)]++;
psq += PSQT::psq[pc][s];
}
inline void Position::remove_piece(Square s) {
- // WARNING: This is not a reversible operation. If we remove a piece in
- // do_move() and then replace it in undo_move() we will put it at the end of
- // the list and not in its original place, it means index[] and pieceList[]
- // are not invariant to a do_move() + undo_move() sequence.
Piece pc = board[s];
byTypeBB[ALL_PIECES] ^= s;
byTypeBB[type_of(pc)] ^= s;
byColorBB[color_of(pc)] ^= s;
/* board[s] = NO_PIECE; Not needed, overwritten by the capturing one */
- Square lastSquare = pieceList[pc][--pieceCount[pc]];
- index[lastSquare] = index[s];
- pieceList[pc][index[lastSquare]] = lastSquare;
- pieceList[pc][pieceCount[pc]] = SQ_NONE;
+ pieceCount[pc]--;
pieceCount[make_piece(color_of(pc), ALL_PIECES)]--;
psq -= PSQT::psq[pc][s];
}
inline void Position::move_piece(Square from, Square to) {
- // index[from] is not updated and becomes stale. This works as long as index[]
- // is accessed just by known occupied squares.
Piece pc = board[from];
Bitboard fromTo = from | to;
byTypeBB[ALL_PIECES] ^= fromTo;
byColorBB[color_of(pc)] ^= fromTo;
board[from] = NO_PIECE;
board[to] = pc;
- index[to] = index[from];
- pieceList[pc][index[to]] = to;
psq += PSQT::psq[pc][to] - PSQT::psq[pc][from];
}
// Razor and futility margins
constexpr int RazorMargin = 510;
Value futility_margin(Depth d, bool improving) {
- return Value(223 * (d - improving));
+ return Value(234 * (d - improving));
}
// Reductions lookup table, initialized at startup
Depth reduction(bool i, Depth d, int mn) {
int r = Reductions[d] * Reductions[mn];
- return (r + 509) / 1024 + (!i && r > 894);
+ return (r + 503) / 1024 + (!i && r > 915);
}
constexpr int futility_move_count(bool improving, Depth depth) {
uint64_t perft(Position& pos, Depth depth) {
StateInfo st;
+ ASSERT_ALIGNED(&st, Eval::NNUE::kCacheLineSize);
+
uint64_t cnt, nodes = 0;
const bool leaf = (depth == 2);
void Search::init() {
for (int i = 1; i < MAX_MOVES; ++i)
- Reductions[i] = int((22.0 + std::log(Threads.size())) * std::log(i));
+ Reductions[i] = int((21.3 + 2 * std::log(Threads.size())) * std::log(i + 0.25 * std::log(i)));
}
Time.init(Limits, us, rootPos.game_ply());
TT.new_search();
- Eval::verify_NNUE();
+ Eval::NNUE::verify();
if (rootMoves.empty())
{
beta = std::min(prev + delta, VALUE_INFINITE);
// Adjust contempt based on root move's previousScore (dynamic contempt)
- int dct = ct + (105 - ct / 2) * prev / (abs(prev) + 149);
+ int dct = ct + (113 - ct / 2) * prev / (abs(prev) + 147);
contempt = (us == WHITE ? make_score(dct, dct / 2)
: -make_score(dct, dct / 2));
// Start with a small aspiration window and, in the case of a fail
// high/low, re-search with a bigger window until we don't fail
// high/low anymore.
- int failedHighCnt = 0;
+ failedHighCnt = 0;
while (true)
{
Depth adjustedDepth = std::max(1, rootDepth - failedHighCnt - searchAgainCounter);
++failedHighCnt;
}
else
- {
- ++rootMoves[pvIdx].bestMoveCount;
break;
- }
delta += delta / 4 + 5;
totBestMoveChanges += th->bestMoveChanges;
th->bestMoveChanges = 0;
}
- double bestMoveInstability = 1 + totBestMoveChanges / Threads.size();
+ double bestMoveInstability = 1 + 2 * totBestMoveChanges / Threads.size();
- double totalTime = rootMoves.size() == 1 ? 0 :
- Time.optimum() * fallingEval * reduction * bestMoveInstability;
+ double totalTime = Time.optimum() * fallingEval * reduction * bestMoveInstability;
- // Stop the search if we have exceeded the totalTime, at least 1ms search
+ // Cap used time in case of a single legal move for a better viewer experience in tournaments
+ // yielding correct scores and sufficiently fast moves.
+ if (rootMoves.size() == 1)
+ totalTime = std::min(500.0, totalTime);
+
+ // Stop the search if we have exceeded the totalTime
if (Time.elapsed() > totalTime)
{
// If we are allowed to ponder do not stop the search now but
constexpr bool PvNode = NT == PV;
const bool rootNode = PvNode && ss->ply == 0;
+ const Depth maxNextDepth = rootNode ? depth : depth + 1;
// Check if we have an upcoming move which draws by repetition, or
// if the opponent had an alternative move earlier to this position.
Move pv[MAX_PLY+1], capturesSearched[32], quietsSearched[64];
StateInfo st;
+ ASSERT_ALIGNED(&st, Eval::NNUE::kCacheLineSize);
+
TTEntry* tte;
Key posKey;
Move ttMove, move, excludedMove, bestMove;
// starts with statScore = 0. Later grandchildren start with the last calculated
// statScore of the previous grandchild. This influences the reduction rules in
// LMR which are based on the statScore of parent position.
- if (rootNode)
- (ss+4)->statScore = 0;
- else
+ if (!rootNode)
(ss+2)->statScore = 0;
// Step 4. Transposition table lookup. We don't want the score of a partial
&& (ss-1)->statScore < 22977
&& eval >= beta
&& eval >= ss->staticEval
- && ss->staticEval >= beta - 30 * depth - 28 * improving + 84 * ss->ttPv + 182
+ && ss->staticEval >= beta - 30 * depth - 28 * improving + 84 * ss->ttPv + 168
&& !excludedMove
&& pos.non_pawn_material(us)
&& (ss->ply >= thisThread->nmpMinPly || us != thisThread->nmpColor))
assert(eval - beta >= 0);
// Null move dynamic reduction based on depth and value
- Depth R = (817 + 71 * depth) / 213 + std::min(int(eval - beta) / 192, 3);
+ Depth R = (1015 + 85 * depth) / 256 + std::min(int(eval - beta) / 191, 3);
ss->currentMove = MOVE_NULL;
ss->continuationHistory = &thisThread->continuationHistory[0][0][NO_PIECE][0];
if (nullValue >= VALUE_TB_WIN_IN_MAX_PLY)
nullValue = beta;
- if (thisThread->nmpMinPly || (abs(beta) < VALUE_KNOWN_WIN && depth < 13))
+ if (thisThread->nmpMinPly || (abs(beta) < VALUE_KNOWN_WIN && depth < 14))
return nullValue;
assert(!thisThread->nmpMinPly); // Recursive verification is not allowed
}
}
- probCutBeta = beta + 176 - 49 * improving;
+ probCutBeta = beta + 183 - 49 * improving;
// Step 10. ProbCut (~10 Elo)
// If we have a good enough capture and a reduced search returns a value
// Futility pruning: parent node (~5 Elo)
if ( lmrDepth < 7
&& !ss->inCheck
- && ss->staticEval + 283 + 170 * lmrDepth <= alpha
+ && ss->staticEval + 266 + 170 * lmrDepth <= alpha
&& (*contHist[0])[movedPiece][to_sq(move)]
+ (*contHist[1])[movedPiece][to_sq(move)]
+ (*contHist[3])[movedPiece][to_sq(move)]
continue;
// Prune moves with negative SEE (~20 Elo)
- if (!pos.see_ge(move, Value(-(29 - std::min(lmrDepth, 18)) * lmrDepth * lmrDepth)))
+ if (!pos.see_ge(move, Value(-(30 - std::min(lmrDepth, 18)) * lmrDepth * lmrDepth)))
continue;
}
else
&& captureHistory[movedPiece][to_sq(move)][type_of(pos.piece_on(to_sq(move)))] < 0)
continue;
- // Futility pruning for captures
- if ( !givesCheck
- && lmrDepth < 6
- && !(PvNode && abs(bestValue) < 2)
- && PieceValue[MG][type_of(movedPiece)] >= PieceValue[MG][type_of(pos.piece_on(to_sq(move)))]
- && !ss->inCheck
- && ss->staticEval + 169 + 244 * lmrDepth
- + PieceValue[MG][type_of(pos.piece_on(to_sq(move)))] <= alpha)
- continue;
-
- // See based pruning
- if (!pos.see_ge(move, Value(-221) * depth)) // (~25 Elo)
+ // SEE based pruning
+ if (!pos.see_ge(move, Value(-213) * depth)) // (~25 Elo)
continue;
}
}
&& pos.non_pawn_material() <= 2 * RookValueMg)
extension = 1;
- // Castling extension
- if ( type_of(move) == CASTLING
- && popcount(pos.pieces(us) & ~pos.pieces(PAWN) & (to_sq(move) & KingSide ? KingSide : QueenSide)) <= 2)
- extension = 1;
-
// Late irreversible move extension
if ( move == ttMove
&& pos.rule50_count() > 80
// Step 16. Reduced depth search (LMR, ~200 Elo). If the move fails high it will be
// re-searched at full depth.
if ( depth >= 3
- && moveCount > 1 + 2 * rootNode + 2 * (PvNode && abs(bestValue) < 2)
+ && moveCount > 1 + 2 * rootNode
&& ( !captureOrPromotion
|| moveCountPruning
|| ss->staticEval + PieceValue[EG][pos.captured_piece()] <= alpha
|| cutNode
- || thisThread->ttHitAverage < 427 * TtHitAverageResolution * TtHitAverageWindow / 1024))
+ || thisThread->ttHitAverage < 432 * TtHitAverageResolution * TtHitAverageWindow / 1024))
{
Depth r = reduction(improving, depth, moveCount);
// Decrease reduction if the ttHit running average is large
- if (thisThread->ttHitAverage > 509 * TtHitAverageResolution * TtHitAverageWindow / 1024)
+ if (thisThread->ttHitAverage > 537 * TtHitAverageResolution * TtHitAverageWindow / 1024)
r--;
- // Reduction if other threads are searching this position
+ // Increase reduction if other threads are searching this position
if (th.marked())
r++;
if (ss->ttPv)
r -= 2;
+ // Increase reduction at root and non-PV nodes when the best move does not change frequently
+ if ((rootNode || !PvNode) && depth > 10 && thisThread->bestMoveChanges <= 2)
+ r++;
+
if (moveCountPruning && !formerPv)
r++;
if (ttCapture)
r++;
+ // Increase reduction at root if failing high
+ r += rootNode ? thisThread->failedHighCnt * thisThread->failedHighCnt * moveCount / 512 : 0;
+
// Increase reduction for cut nodes (~10 Elo)
if (cutNode)
r += 2;
- 5287;
// Decrease/increase reduction by comparing opponent's stat score (~10 Elo)
- if (ss->statScore >= -106 && (ss-1)->statScore < -104)
+ if (ss->statScore >= -105 && (ss-1)->statScore < -103)
r--;
- else if ((ss-1)->statScore >= -119 && ss->statScore < -140)
+ else if ((ss-1)->statScore >= -122 && ss->statScore < -129)
r++;
// Decrease/increase reduction for moves with a good/bad history (~30 Elo)
}
else
{
- // Increase reduction for captures/promotions if late move and at low depth
- if (depth < 8 && moveCount > 2)
- r++;
-
- // Unless giving check, this capture is likely bad
- if ( !givesCheck
- && ss->staticEval + PieceValue[EG][pos.captured_piece()] + 213 * depth <= alpha)
- r++;
+ // Unless giving check, this capture is likely bad
+ if ( !givesCheck
+ && ss->staticEval + PieceValue[EG][pos.captured_piece()] + 210 * depth <= alpha)
+ r++;
}
Depth d = std::clamp(newDepth - r, 1, newDepth);
int bonus = value > alpha ? stat_bonus(newDepth)
: -stat_bonus(newDepth);
- if (move == ss->killers[0])
- bonus += bonus / 4;
-
update_continuation_histories(ss, movedPiece, to_sq(move), bonus);
}
}
(ss+1)->pv = pv;
(ss+1)->pv[0] = MOVE_NONE;
- value = -search<PV>(pos, ss+1, -beta, -alpha, newDepth, false);
+ value = -search<PV>(pos, ss+1, -beta, -alpha,
+ std::min(maxNextDepth, newDepth), false);
}
// Step 18. Undo move
Move pv[MAX_PLY+1];
StateInfo st;
+ ASSERT_ALIGNED(&st, Eval::NNUE::kCacheLineSize);
+
TTEntry* tte;
Key posKey;
Move ttMove, move, bestMove;
if (PvNode && bestValue > alpha)
alpha = bestValue;
- futilityBase = bestValue + 145;
+ futilityBase = bestValue + 155;
}
const PieceToHistory* contHist[] = { (ss-1)->continuationHistory, (ss-2)->continuationHistory,
moveCount++;
// Futility pruning
- if ( !ss->inCheck
+ if ( bestValue > VALUE_TB_LOSS_IN_MAX_PLY
&& !givesCheck
&& futilityBase > -VALUE_KNOWN_WIN
&& !pos.advanced_pawn_push(move))
}
// Do not search moves with negative SEE values
- if ( !ss->inCheck
- && !(givesCheck && pos.is_discovery_check_on_king(~pos.side_to_move(), move))
+ if ( bestValue > VALUE_TB_LOSS_IN_MAX_PLY
&& !pos.see_ge(move))
continue;
[pos.moved_piece(move)]
[to_sq(move)];
+ // CounterMove based pruning
if ( !captureOrPromotion
- && moveCount
+ && bestValue > VALUE_TB_LOSS_IN_MAX_PLY
&& (*contHist[0])[pos.moved_piece(move)][to_sq(move)] < CounterMovePruneThreshold
&& (*contHist[1])[pos.moved_piece(move)][to_sq(move)] < CounterMovePruneThreshold)
continue;
// All legal moves have been searched. A special case: if we're in check
// and no legal moves were found, it is checkmate.
if (ss->inCheck && bestValue == -VALUE_INFINITE)
+ {
+ assert(!MoveList<LEGAL>(pos).size());
+
return mated_in(ss->ply); // Plies to mate from the root
+ }
tte->save(posKey, value_to_tt(bestValue, ss->ply), pvHit,
bestValue >= beta ? BOUND_LOWER :
bool RootMove::extract_ponder_from_tt(Position& pos) {
StateInfo st;
+ ASSERT_ALIGNED(&st, Eval::NNUE::kCacheLineSize);
+
bool ttHit;
assert(pv.size() == 1);
Value previousScore = -VALUE_INFINITE;
int selDepth = 0;
int tbRank = 0;
- int bestMoveCount = 0;
Value tbScore;
std::vector<Move> pv;
};
votes[th->rootMoves[0].pv[0]] +=
(th->rootMoves[0].score - minScore + 14) * int(th->completedDepth);
- if (abs(bestThread->rootMoves[0].score) >= VALUE_TB_WIN_IN_MAX_PLY)
- {
- // Make sure we pick the shortest mate / TB conversion or stave off mate the longest
- if (th->rootMoves[0].score > bestThread->rootMoves[0].score)
- bestThread = th;
- }
- else if ( th->rootMoves[0].score >= VALUE_TB_WIN_IN_MAX_PLY
- || ( th->rootMoves[0].score > VALUE_TB_LOSS_IN_MAX_PLY
- && votes[th->rootMoves[0].pv[0]] > votes[bestThread->rootMoves[0].pv[0]]))
- bestThread = th;
+ if (abs(bestThread->rootMoves[0].score) >= VALUE_TB_WIN_IN_MAX_PLY)
+ {
+ // Make sure we pick the shortest mate / TB conversion or stave off mate the longest
+ if (th->rootMoves[0].score > bestThread->rootMoves[0].score)
+ bestThread = th;
+ }
+ else if ( th->rootMoves[0].score >= VALUE_TB_WIN_IN_MAX_PLY
+ || ( th->rootMoves[0].score > VALUE_TB_LOSS_IN_MAX_PLY
+ && votes[th->rootMoves[0].pv[0]] > votes[bestThread->rootMoves[0].pv[0]]))
+ bestThread = th;
}
return bestThread;
CapturePieceToHistory captureHistory;
ContinuationHistory continuationHistory[2][2];
Score contempt;
+ int failedHighCnt;
};
// game time for the current move, so also cap to 20% of available game time.
if (limits.movestogo == 0)
{
- optScale = std::min(0.008 + std::pow(ply + 3.0, 0.5) / 250.0,
+ optScale = std::min(0.0084 + std::pow(ply + 3.0, 0.5) * 0.0042,
0.2 * limits.time[us] / double(timeLeft));
maxScale = std::min(7.0, 4.0 + ply / 12.0);
}
Threads.main()->wait_for_search_finished();
- aligned_ttmem_free(mem);
+ aligned_large_pages_free(table);
clusterCount = mbSize * 1024 * 1024 / sizeof(Cluster);
- table = static_cast<Cluster*>(aligned_ttmem_alloc(clusterCount * sizeof(Cluster), mem));
- if (!mem)
+
+ table = static_cast<Cluster*>(aligned_large_pages_alloc(clusterCount * sizeof(Cluster)));
+ if (!table)
{
std::cerr << "Failed to allocate " << mbSize
<< "MB for transposition table." << std::endl;
static_assert(sizeof(Cluster) == 32, "Unexpected Cluster size");
public:
- ~TranspositionTable() { aligned_ttmem_free(mem); }
+ ~TranspositionTable() { aligned_large_pages_free(table); }
void new_search() { generation8 += 8; } // Lower 3 bits are used by PV flag and Bound
TTEntry* probe(const Key key, bool& found) const;
int hashfull() const;
size_t clusterCount;
Cluster* table;
- void* mem;
uint8_t generation8; // Size must be not bigger than TTEntry::genBound8
};
/// _WIN32 Building on Windows (any)
/// _WIN64 Building on Windows 64 bit
+#if defined(__GNUC__ ) && (__GNUC__ < 9 || (__GNUC__ == 9 && __GNUC_MINOR__ <= 2)) && defined(_WIN32) && !defined(__clang__)
+#define ALIGNAS_ON_STACK_VARIABLES_BROKEN
+#endif
+
+#define ASSERT_ALIGNED(ptr, alignment) assert(reinterpret_cast<uintptr_t>(ptr) % alignment == 0)
+
#if defined(_WIN64) && defined(_MSC_VER) // No Makefile used
# include <intrin.h> // Microsoft header for _BitScanForward64()
# define IS_64BIT
enum Piece {
NO_PIECE,
- W_PAWN = 1, W_KNIGHT, W_BISHOP, W_ROOK, W_QUEEN, W_KING,
- B_PAWN = 9, B_KNIGHT, B_BISHOP, B_ROOK, B_QUEEN, B_KING,
+ W_PAWN = PAWN, W_KNIGHT, W_BISHOP, W_ROOK, W_QUEEN, W_KING,
+ B_PAWN = PAWN + 8, B_KNIGHT, B_BISHOP, B_ROOK, B_QUEEN, B_KING,
PIECE_NB = 16
};
Position p;
p.set(pos.fen(), Options["UCI_Chess960"], &states->back(), Threads.main());
- Eval::verify_NNUE();
+ Eval::NNUE::verify();
sync_cout << "\n" << Eval::trace(p) << sync_endl;
}
if (token == "go" || token == "eval")
{
- cerr << "\nPosition: " << cnt++ << '/' << num << endl;
+ cerr << "\nPosition: " << cnt++ << '/' << num << " (" << pos.fen() << ")" << endl;
if (token == "go")
{
go(pos, is, states);
void on_logger(const Option& o) { start_logger(o); }
void on_threads(const Option& o) { Threads.set(size_t(o)); }
void on_tb_path(const Option& o) { Tablebases::init(o); }
-void on_use_NNUE(const Option& ) { Eval::init_NNUE(); }
-void on_eval_file(const Option& ) { Eval::init_NNUE(); }
+void on_use_NNUE(const Option& ) { Eval::NNUE::init(); }
+void on_eval_file(const Option& ) { Eval::NNUE::init(); }
void on_rpc_server_address(const Option& o) {
if (hash_probe_thread) {
hash_probe_thread->Shutdown();
--valgrind-thread)
echo "valgrind-thread testing started"
prefix=''
- exeprefix='valgrind --error-exitcode=42'
+ exeprefix='valgrind --fair-sched=try --error-exitcode=42'
postfix='1>/dev/null'
threads="2"
;;