Merge branch 'master' into clusterMergeMaster13

author Joost VandeVondele <Joost.VandeVondele@gmail.com>

Sun, 31 Oct 2021 07:38:10 +0000 (08:38 +0100)

committer Joost VandeVondele <Joost.VandeVondele@gmail.com>

Sun, 31 Oct 2021 15:17:24 +0000 (16:17 +0100)
author Joost VandeVondele <Joost.VandeVondele@gmail.com>
Sun, 31 Oct 2021 07:38:10 +0000 (08:38 +0100)
committer Joost VandeVondele <Joost.VandeVondele@gmail.com>
Sun, 31 Oct 2021 15:17:24 +0000 (16:17 +0100)
diff --cc README.md

index 15153740537ba25fbe60041b1f58fdb48bad452a,79db8170257dfd66cabb4a997adc054b66641ed3..554a0da2db80579e5ed155bf4452b2f090a71272
--- 1/README.md
--- 2/README.md
+++ b/README.md
@@@ -196,35 -223,6 +223,33 @@@ more compact than Nalimov tablebases, w
   needed for optimal play and in addition being able to take into account
   the 50-move rule.
   
- branch, it is sufficient to specify ```COMPILER=mpicxx``` on the make command line,
- and do a clean build:
+ +## Stockfish on distributed memory systems
+ +
+ +The cluster branch allows for running Stockfish on a cluster of servers (nodes)
+ +that are connected with a high-speed and low-latency network, using the message
+ +passing interface (MPI). In this case, one MPI process should be run per node,
+ +and UCI options can be used to set the number of threads/hash per node as usual.
+ +Typically, the engine will be invoked as
+ +```
+ +mpirun -np N /path/to/stockfish
+ +```
+ +where ```N``` stands for the number of MPI processes used (alternatives to ```mpirun```,
+ +include ```mpiexec```, ```srun```). Use 1 mpi rank per node, and employ threading
+ +according to the cores per node. To build the cluster
- make -j ARCH=x86-64-modern clean build COMPILER=mpicxx
++branch, it is sufficient to specify ```COMPILER=mpicxx``` (or e.g. CC depending on the name
++of the compiler providing MPI support) on the make command line, and do a clean build:
+ +```
- If the name of the compiler wrapper (typically mpicxx, but sometimes e.g. CC) does
- not match ```mpi``` an edit to the Makefile is required. Make sure that the MPI
- installation is configured to support ```MPI_THREAD_MULTIPLE```, this might require
- adding system specific compiler options to the Makefile. Stockfish employs
++make -j ARCH=x86-64-modern clean build COMPILER=mpicxx mpi=yes
+ +```
++Make sure that the MPI installation is configured to support ```MPI_THREAD_MULTIPLE```,
++this might require adding system specific compiler options to the Makefile. Stockfish employs
+ +non-blocking (asynchronous) communication, and benefits from an MPI
+ +implementation that efficiently supports this. Some MPI implentations might benefit
+ +from leaving 1 core/thread free for these asynchronous communications, and might require
+ +setting additional environment variables. ```mpirun``` should forward stdin/stdout
+ +to ```rank 0``` only (e.g. ```srun --input=0 --output=0```).
+ + Refer to your MPI documentation for more info.
+ +
   ## Large Pages
   
   Stockfish supports large pages on Linux and Windows. Large pages make
diff --cc src/Makefile

index a3613fbe5ba4ec49a9861ff47e99a5a0a8651c86,fea597e76cb941eb8d79bc9135a497fecc82e60e..9be223fdbb75694ecbf62fe49d728fc99a5c507f
--- 1/src/Makefile
--- 2/src/Makefile
+++ b/src/Makefile
@@@ -31,13 -31,17 +31,17 @@@ PREFIX = /usr/loca
   BINDIR = $(PREFIX)/bin
   
   ### Built-in benchmark for pgo-builds
- PGOBENCH = ./$(EXE) bench
+ ifeq ($(SDE_PATH),)
+       PGOBENCH = ./$(EXE) bench
+ else
+       PGOBENCH = $(SDE_PATH) -- ./$(EXE) bench
+ endif
   
   ### Source and object files
- -SRCS = benchmark.cpp bitbase.cpp bitboard.cpp endgame.cpp evaluate.cpp main.cpp \
+ +SRCS = benchmark.cpp bitbase.cpp bitboard.cpp cluster.cpp endgame.cpp evaluate.cpp main.cpp \
         material.cpp misc.cpp movegen.cpp movepick.cpp pawns.cpp position.cpp psqt.cpp \
         search.cpp thread.cpp timeman.cpp tt.cpp uci.cpp ucioption.cpp tune.cpp syzygy/tbprobe.cpp \
-       nnue/evaluate_nnue.cpp nnue/features/half_kp.cpp
+       nnue/evaluate_nnue.cpp nnue/features/half_ka_v2_hm.cpp
   
   OBJS = $(notdir $(SRCS:.cpp=.o))
   
@@@ -639,13 -663,6 +665,15 @@@ ifeq ($(OS), Android
         LDFLAGS += -fPIE -pie
   endif
   
+ +### 3.10 MPI
+ +ifneq (,$(findstring mpi, $(CXX)))
+ +      mpi = yes
++endif
++ifeq ($(mpi),yes)
+ +      CXXFLAGS += -DUSE_MPI -Wno-cast-qual -fexceptions
+ +        DEPENDFLAGS += -DUSE_MPI
+ +endif
+ +
   ### ==========================================================================
   ### Section 4. Public Targets
   ### ==========================================================================
diff --cc src/cluster.cpp

index 7d0cc0f8ad684a2c17224f034bc5ecaad6c9f3bc,0000000000000000000000000000000000000000..8210b0f6a10cace3b1bc04f86dc31ab6ee8c0baf

mode 100644,000000..100644
--- 1/src/cluster.cpp
--- /dev/null
+++ b/src/cluster.cpp
@@@ -1,483 -1,0 +1,483 @@@
-   bool state;
+ +/*
+ +  Stockfish, a UCI chess playing engine derived from Glaurung 2.1
+ +  Copyright (C) 2004-2020 The Stockfish developers (see AUTHORS file)
+ +
+ +  Stockfish is free software: you can redistribute it and/or modify
+ +  it under the terms of the GNU General Public License as published by
+ +  the Free Software Foundation, either version 3 of the License, or
+ +  (at your option) any later version.
+ +
+ +  Stockfish is distributed in the hope that it will be useful,
+ +  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ +  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ +  GNU General Public License for more details.
+ +
+ +  You should have received a copy of the GNU General Public License
+ +  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ +*/
+ +
+ +#ifdef USE_MPI
+ +
+ +#include <array>
+ +#include <map>
+ +#include <cstddef>
+ +#include <cstdlib>
+ +#include <iostream>
+ +#include <istream>
+ +#include <mpi.h>
+ +#include <string>
+ +#include <vector>
+ +
+ +#include "cluster.h"
+ +#include "thread.h"
+ +#include "tt.h"
+ +#include "timeman.h"
+ +
+ +namespace Stockfish {
+ +namespace Cluster {
+ +
+ +// Total number of ranks and rank within the communicator
+ +static int world_rank = MPI_PROC_NULL;
+ +static int world_size = 0;
+ +
+ +// Signals between ranks exchange basic info using a dedicated communicator
+ +static MPI_Comm signalsComm = MPI_COMM_NULL;
+ +static MPI_Request reqSignals = MPI_REQUEST_NULL;
+ +static uint64_t signalsCallCounter = 0;
+ +
+ +// Signals are the number of nodes searched, stop, table base hits, transposition table saves
+ +enum Signals : int { SIG_NODES = 0, SIG_STOP = 1, SIG_TB = 2, SIG_TTS = 3, SIG_NB = 4};
+ +static uint64_t signalsSend[SIG_NB] = {};
+ +static uint64_t signalsRecv[SIG_NB] = {};
+ +static uint64_t nodesSearchedOthers = 0;
+ +static uint64_t tbHitsOthers = 0;
+ +static uint64_t TTsavesOthers = 0;
+ +static uint64_t stopSignalsPosted = 0;
+ +
+ +// The UCI threads of each rank exchange use a dedicated communicator
+ +static MPI_Comm InputComm = MPI_COMM_NULL;
+ +
+ +// bestMove requires MoveInfo communicators and data types
+ +static MPI_Comm MoveComm = MPI_COMM_NULL;
+ +static MPI_Datatype MIDatatype = MPI_DATATYPE_NULL;
+ +
+ +// TT entries are communicated with a dedicated communicator.
+ +// The receive buffer is used to gather information from all ranks.
+ +// THe TTCacheCounter tracks the number of local elements that are ready to be sent.
+ +static MPI_Comm TTComm = MPI_COMM_NULL;
+ +static std::array<std::vector<KeyedTTEntry>, 2> TTSendRecvBuffs;
+ +static std::array<MPI_Request, 2> reqsTTSendRecv = {MPI_REQUEST_NULL, MPI_REQUEST_NULL};
+ +static uint64_t sendRecvPosted = 0;
+ +static std::atomic<uint64_t> TTCacheCounter = {};
+ +
+ +/// Initialize MPI and associated data types. Note that the MPI library must be configured
+ +/// to support MPI_THREAD_MULTIPLE, since multiple threads access MPI simultaneously.
+ +void init() {
+ +
+ +  int thread_support;
+ +  MPI_Init_thread(nullptr, nullptr, MPI_THREAD_MULTIPLE, &thread_support);
+ +  if (thread_support < MPI_THREAD_MULTIPLE)
+ +  {
+ +      std::cerr << "Stockfish requires support for MPI_THREAD_MULTIPLE."
+ +                << std::endl;
+ +      std::exit(EXIT_FAILURE);
+ +  }
+ +
+ +  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
+ +  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
+ +
+ +  const std::array<MPI_Aint, 5> MIdisps = {offsetof(MoveInfo, move),
+ +                                           offsetof(MoveInfo, ponder),
+ +                                           offsetof(MoveInfo, depth),
+ +                                           offsetof(MoveInfo, score),
+ +                                           offsetof(MoveInfo, rank)};
+ +  MPI_Type_create_hindexed_block(5, 1, MIdisps.data(), MPI_INT, &MIDatatype);
+ +  MPI_Type_commit(&MIDatatype);
+ +
+ +  MPI_Comm_dup(MPI_COMM_WORLD, &InputComm);
+ +  MPI_Comm_dup(MPI_COMM_WORLD, &TTComm);
+ +  MPI_Comm_dup(MPI_COMM_WORLD, &MoveComm);
+ +  MPI_Comm_dup(MPI_COMM_WORLD, &signalsComm);
+ +}
+ +
+ +/// Finalize MPI and free the associated data types.
+ +void finalize() {
+ +
+ +  MPI_Type_free(&MIDatatype);
+ +
+ +  MPI_Comm_free(&InputComm);
+ +  MPI_Comm_free(&TTComm);
+ +  MPI_Comm_free(&MoveComm);
+ +  MPI_Comm_free(&signalsComm);
+ +
+ +  MPI_Finalize();
+ +}
+ +
+ +/// Return the total number of ranks
+ +int size() {
+ +
+ +  return world_size;
+ +}
+ +
+ +/// Return the rank (index) of the process
+ +int rank() {
+ +
+ +  return world_rank;
+ +}
+ +
+ +/// The receive buffer depends on the number of MPI ranks and threads, resize as needed
+ +void ttSendRecvBuff_resize(size_t nThreads) {
+ +
+ +  for (int i : {0, 1})
+ +  {
+ +     TTSendRecvBuffs[i].resize(TTCacheSize * world_size * nThreads);
+ +     std::fill(TTSendRecvBuffs[i].begin(), TTSendRecvBuffs[i].end(), KeyedTTEntry());
+ +  }
+ +}
+ +
+ +/// As input is only received by the root (rank 0) of the cluster, this input must be relayed
+ +/// to the UCI threads of all ranks, in order to setup the position, etc. We do this with a
+ +/// dedicated getline implementation, where the root broadcasts to all other ranks the received
+ +/// information.
+ +bool getline(std::istream& input, std::string& str) {
+ +
+ +  int size;
+ +  std::vector<char> vec;
-   MPI_Bcast(&state, 1, MPI_CXX_BOOL, 0, InputComm);
++  int state;
+ +
+ +  if (is_root())
+ +  {
+ +      state = static_cast<bool>(std::getline(input, str));
+ +      vec.assign(str.begin(), str.end());
+ +      size = vec.size();
+ +  }
+ +
+ +  // Some MPI implementations use busy-wait polling, while we need yielding as otherwise
+ +  // the UCI thread on the non-root ranks would be consuming resources.
+ +  static MPI_Request reqInput = MPI_REQUEST_NULL;
+ +  MPI_Ibcast(&size, 1, MPI_INT, 0, InputComm, &reqInput);
+ +  if (is_root())
+ +      MPI_Wait(&reqInput, MPI_STATUS_IGNORE);
+ +  else
+ +  {
+ +      while (true)
+ +      {
+ +          int flag;
+ +          MPI_Test(&reqInput, &flag, MPI_STATUS_IGNORE);
+ +          if (flag)
+ +              break;
+ +          else
+ +              std::this_thread::sleep_for(std::chrono::milliseconds(10));
+ +      }
+ +  }
+ +
+ +  // Broadcast received string
+ +  if (!is_root())
+ +      vec.resize(size);
+ +  MPI_Bcast(vec.data(), size, MPI_CHAR, 0, InputComm);
+ +  if (!is_root())
+ +      str.assign(vec.begin(), vec.end());
++  MPI_Bcast(&state, 1, MPI_INT, 0, InputComm);
+ +
+ +  return state;
+ +}
+ +
+ +/// Sending part of the signal communication loop
+ +void signals_send() {
+ +
+ +  signalsSend[SIG_NODES] = Threads.nodes_searched();
+ +  signalsSend[SIG_TB] = Threads.tb_hits();
+ +  signalsSend[SIG_TTS] = Threads.TT_saves();
+ +  signalsSend[SIG_STOP] = Threads.stop;
+ +  MPI_Iallreduce(signalsSend, signalsRecv, SIG_NB, MPI_UINT64_T,
+ +                 MPI_SUM, signalsComm, &reqSignals);
+ +  ++signalsCallCounter;
+ +}
+ +
+ +/// Processing part of the signal communication loop.
+ +/// For some counters (e.g. nodes) we only keep their sum on the other nodes
+ +/// allowing to add local counters at any time for more fine grained process,
+ +/// which is useful to indicate progress during early iterations, and to have
+ +/// node counts that exactly match the non-MPI code in the single rank case.
+ +/// This call also propagates the stop signal between ranks.
+ +void signals_process() {
+ +
+ +  nodesSearchedOthers = signalsRecv[SIG_NODES] - signalsSend[SIG_NODES];
+ +  tbHitsOthers = signalsRecv[SIG_TB] - signalsSend[SIG_TB];
+ +  TTsavesOthers = signalsRecv[SIG_TTS] - signalsSend[SIG_TTS];
+ +  stopSignalsPosted = signalsRecv[SIG_STOP];
+ +  if (signalsRecv[SIG_STOP] > 0)
+ +      Threads.stop = true;
+ +}
+ +
+ +void sendrecv_post() {
+ +
+ +   ++sendRecvPosted;
+ +   MPI_Irecv(TTSendRecvBuffs[sendRecvPosted       % 2].data(),
+ +             TTSendRecvBuffs[sendRecvPosted       % 2].size() * sizeof(KeyedTTEntry), MPI_BYTE,
+ +             (rank() + size() - 1) % size(), 42, TTComm, &reqsTTSendRecv[0]);
+ +   MPI_Isend(TTSendRecvBuffs[(sendRecvPosted + 1) % 2].data(),
+ +             TTSendRecvBuffs[(sendRecvPosted + 1) % 2].size() * sizeof(KeyedTTEntry), MPI_BYTE,
+ +             (rank() + 1         ) % size(), 42, TTComm, &reqsTTSendRecv[1]);
+ +}
+ +
+ +/// During search, most message passing is asynchronous, but at the end of
+ +/// search it makes sense to bring them to a common, finalized state.
+ +void signals_sync() {
+ +
+ +  while(stopSignalsPosted < uint64_t(size()))
+ +      signals_poll();
+ +
+ +  // Finalize outstanding messages of the signal loops.
+ +  // We might have issued one call less than needed on some ranks.
+ +  uint64_t globalCounter;
+ +  MPI_Allreduce(&signalsCallCounter, &globalCounter, 1, MPI_UINT64_T, MPI_MAX, MoveComm);
+ +  if (signalsCallCounter < globalCounter)
+ +  {
+ +      MPI_Wait(&reqSignals, MPI_STATUS_IGNORE);
+ +      signals_send();
+ +  }
+ +  assert(signalsCallCounter == globalCounter);
+ +  MPI_Wait(&reqSignals, MPI_STATUS_IGNORE);
+ +  signals_process();
+ +
+ +  // Finalize outstanding messages in the sendRecv loop
+ +  MPI_Allreduce(&sendRecvPosted, &globalCounter, 1, MPI_UINT64_T, MPI_MAX, MoveComm);
+ +  while (sendRecvPosted < globalCounter)
+ +  {
+ +      MPI_Waitall(reqsTTSendRecv.size(), reqsTTSendRecv.data(), MPI_STATUSES_IGNORE);
+ +      sendrecv_post();
+ +  }
+ +  assert(sendRecvPosted == globalCounter);
+ +  MPI_Waitall(reqsTTSendRecv.size(), reqsTTSendRecv.data(), MPI_STATUSES_IGNORE);
+ +
+ +}
+ +
+ +/// Initialize signal counters to zero.
+ +void signals_init() {
+ +
+ +  stopSignalsPosted = tbHitsOthers = TTsavesOthers = nodesSearchedOthers = 0;
+ +
+ +  signalsSend[SIG_NODES] = signalsRecv[SIG_NODES] = 0;
+ +  signalsSend[SIG_TB] = signalsRecv[SIG_TB] = 0;
+ +  signalsSend[SIG_TTS] = signalsRecv[SIG_TTS] = 0;
+ +  signalsSend[SIG_STOP] = signalsRecv[SIG_STOP] = 0;
+ +
+ +}
+ +
+ +/// Poll the signal loop, and start next round as needed.
+ +void signals_poll() {
+ +
+ +  int flag;
+ +  MPI_Test(&reqSignals, &flag, MPI_STATUS_IGNORE);
+ +  if (flag)
+ +  {
+ +     signals_process();
+ +     signals_send();
+ +  }
+ +}
+ +
+ +/// Provide basic info related the cluster performance, in particular, the number of signals send,
+ +/// signals per sounds (sps), the number of gathers, the number of positions gathered (per node and per second, gpps)
+ +/// The number of TT saves and TT saves per second. If gpps equals approximately TTSavesps the gather loop has enough bandwidth.
+ +void cluster_info(Depth depth) {
+ +
+ +  TimePoint elapsed = Time.elapsed() + 1;
+ +  uint64_t TTSaves = TT_saves();
+ +
+ +  sync_cout << "info depth " << depth << " cluster "
+ +            << " signals " << signalsCallCounter << " sps " << signalsCallCounter * 1000 / elapsed
+ +            << " sendRecvs " << sendRecvPosted << " srpps " <<  TTSendRecvBuffs[0].size() * sendRecvPosted * 1000 / elapsed
+ +            << " TTSaves " << TTSaves << " TTSavesps " << TTSaves * 1000 / elapsed
+ +            << sync_endl;
+ +}
+ +
+ +/// When a TT entry is saved, additional steps are taken if the entry is of sufficient depth.
+ +/// If sufficient entries has been collected, a communication is initiated.
+ +/// If a communication has been completed, the received results are saved to the TT.
+ +void save(Thread* thread, TTEntry* tte,
+ +          Key k, Value v, bool PvHit, Bound b, Depth d, Move m, Value ev) {
+ +
+ +  // Standard save to the TT
+ +  tte->save(k, v, PvHit, b, d, m, ev);
+ +
+ +  // If the entry is of sufficient depth to be worth communicating, take action.
+ +  if (d > 3)
+ +  {
+ +     // count the TTsaves to information: this should be relatively similar
+ +     // to the number of entries we can send/recv.
+ +     thread->TTsaves.fetch_add(1, std::memory_order_relaxed);
+ +
+ +     // Add to thread's send buffer, the locking here avoids races when the master thread
+ +     // prepares the send buffer.
+ +     {
+ +         std::lock_guard<std::mutex> lk(thread->ttCache.mutex);
+ +         thread->ttCache.buffer.replace(KeyedTTEntry(k,*tte));
+ +       ++TTCacheCounter;
+ +     }
+ +
+ +     size_t recvBuffPerRankSize = Threads.size() * TTCacheSize;
+ +
+ +     // Communicate on main search thread, as soon the threads combined have collected
+ +     // sufficient data to fill the send buffers.
+ +     if (thread == Threads.main() && TTCacheCounter > recvBuffPerRankSize)
+ +     {
+ +         // Test communication status
+ +         int flag;
+ +         MPI_Testall(reqsTTSendRecv.size(), reqsTTSendRecv.data(), &flag, MPI_STATUSES_IGNORE);
+ +
+ +         // Current communication is complete
+ +         if (flag)
+ +         {
+ +             // Save all received entries to TT, and store our TTCaches, ready for the next round of communication
+ +             for (size_t irank = 0; irank < size_t(size()) ; ++irank)
+ +             {
+ +                 if (irank == size_t(rank())) // this is our part, fill the part of the buffer for sending
+ +                 {
+ +                    // Copy from the thread caches to the right spot in the buffer
+ +                    size_t i = irank * recvBuffPerRankSize;
+ +                    for (auto&& th : Threads)
+ +                    {
+ +                        std::lock_guard<std::mutex> lk(th->ttCache.mutex);
+ +
+ +                        for (auto&& e : th->ttCache.buffer)
+ +                            TTSendRecvBuffs[sendRecvPosted % 2][i++] = e;
+ +
+ +                        // Reset thread's send buffer
+ +                        th->ttCache.buffer = {};
+ +                    }
+ +
+ +                  TTCacheCounter = 0;
+ +                 }
+ +                 else // process data received from the corresponding rank.
+ +                    for (size_t i = irank * recvBuffPerRankSize; i < (irank + 1) * recvBuffPerRankSize; ++i)
+ +                    {
+ +                        auto&& e = TTSendRecvBuffs[sendRecvPosted % 2][i];
+ +                        bool found;
+ +                        TTEntry* replace_tte;
+ +                        replace_tte = TT.probe(e.first, found);
+ +                        replace_tte->save(e.first, e.second.value(), e.second.is_pv(), e.second.bound(), e.second.depth(),
+ +                                          e.second.move(), e.second.eval());
+ +                    }
+ +             }
+ +
+ +             // Start next communication
+ +             sendrecv_post();
+ +
+ +           // Force check of time on the next occasion, the above actions might have taken some time.
+ +             static_cast<MainThread*>(thread)->callsCnt = 0;
+ +
+ +         }
+ +     }
+ +  }
+ +}
+ +
+ +/// Picks the bestMove across ranks, and send the associated info and PV to the root of the cluster.
+ +/// Note that this bestMove and PV must be output by the root, the guarantee proper ordering of output.
+ +/// TODO update to the scheme in master.. can this use aggregation of votes?
+ +void pick_moves(MoveInfo& mi, std::string& PVLine) {
+ +
+ +  MoveInfo* pMoveInfo = NULL;
+ +  if (is_root())
+ +  {
+ +      pMoveInfo = (MoveInfo*)malloc(sizeof(MoveInfo) * size());
+ +  }
+ +  MPI_Gather(&mi, 1, MIDatatype, pMoveInfo, 1, MIDatatype, 0, MoveComm);
+ +
+ +  if (is_root())
+ +  {
+ +      std::map<int, int> votes;
+ +      int minScore = pMoveInfo[0].score;
+ +      for (int i = 0; i < size(); ++i)
+ +      {
+ +          minScore = std::min(minScore, pMoveInfo[i].score);
+ +          votes[pMoveInfo[i].move] = 0;
+ +      }
+ +      for (int i = 0; i < size(); ++i)
+ +      {
+ +          votes[pMoveInfo[i].move] += pMoveInfo[i].score - minScore + pMoveInfo[i].depth;
+ +      }
+ +      int bestVote = votes[pMoveInfo[0].move];
+ +      for (int i = 0; i < size(); ++i)
+ +      {
+ +          if (votes[pMoveInfo[i].move] > bestVote)
+ +          {
+ +              bestVote = votes[pMoveInfo[i].move];
+ +              mi = pMoveInfo[i];
+ +          }
+ +      }
+ +      free(pMoveInfo);
+ +  }
+ +
+ +  // Send around the final result
+ +  MPI_Bcast(&mi, 1, MIDatatype, 0, MoveComm);
+ +
+ +  // Send PV line to root as needed
+ +  if (mi.rank != 0 && mi.rank == rank()) {
+ +      int size;
+ +      std::vector<char> vec;
+ +      vec.assign(PVLine.begin(), PVLine.end());
+ +      size = vec.size();
+ +      MPI_Send(&size, 1, MPI_INT, 0, 42, MoveComm);
+ +      MPI_Send(vec.data(), size, MPI_CHAR, 0, 42, MoveComm);
+ +  }
+ +  if (mi.rank != 0 && is_root()) {
+ +      int size;
+ +      std::vector<char> vec;
+ +      MPI_Recv(&size, 1, MPI_INT, mi.rank, 42, MoveComm, MPI_STATUS_IGNORE);
+ +      vec.resize(size);
+ +      MPI_Recv(vec.data(), size, MPI_CHAR, mi.rank, 42, MoveComm, MPI_STATUS_IGNORE);
+ +      PVLine.assign(vec.begin(), vec.end());
+ +  }
+ +
+ +}
+ +
+ +/// Return nodes searched (lazily updated cluster wide in the signal loop)
+ +uint64_t nodes_searched() {
+ +
+ +  return nodesSearchedOthers + Threads.nodes_searched();
+ +}
+ +
+ +/// Return table base hits (lazily updated cluster wide in the signal loop)
+ +uint64_t tb_hits() {
+ +
+ +  return tbHitsOthers + Threads.tb_hits();
+ +}
+ +
+ +/// Return the number of saves to the TT buffers, (lazily updated cluster wide in the signal loop)
+ +uint64_t TT_saves() {
+ +
+ +  return TTsavesOthers + Threads.TT_saves();
+ +}
+ +
+ +
+ +}
+ +}
+ +
+ +#else
+ +
+ +#include "cluster.h"
+ +#include "thread.h"
+ +
+ +namespace Stockfish {
+ +namespace Cluster {
+ +
+ +uint64_t nodes_searched() {
+ +
+ +  return Threads.nodes_searched();
+ +}
+ +
+ +uint64_t tb_hits() {
+ +
+ +  return Threads.tb_hits();
+ +}
+ +
+ +uint64_t TT_saves() {
+ +
+ +  return Threads.TT_saves();
+ +}
+ +
+ +}
+ +}
+ +
+ +#endif // USE_MPI
diff --cc src/evaluate.cpp
Simple merge
diff --cc src/search.cpp

index d55fe59a1b6c56b5eca0e449605e233823c0ffe9,36329f8fe4205066f168f04c7e88cb084d36923d..0b5e7fdd3ab0531011f6306171e937de36242899
--- 1/src/search.cpp
--- 2/src/search.cpp
+++ b/src/search.cpp
@@@ -844,17 -787,11 +825,13 @@@ namespace 
       }
       else
       {
-         // In case of null move search use previous static eval with a different sign
-         // and addition of two tempos
-         if ((ss-1)->currentMove != MOVE_NULL)
-             ss->staticEval = eval = evaluate(pos);
-         else
-             ss->staticEval = eval = -(ss-1)->staticEval + 2 * Tempo;
+         ss->staticEval = eval = evaluate(pos);
   
           // Save static evaluation into transposition table
-         Cluster::save(thisThread, tte,
-                       posKey, VALUE_NONE, ss->ttPv, BOUND_NONE, DEPTH_NONE, MOVE_NONE,
-                       eval);
+         if (!excludedMove)
- -            tte->save(posKey, VALUE_NONE, ss->ttPv, BOUND_NONE, DEPTH_NONE, MOVE_NONE, eval);
++            Cluster::save(thisThread, tte,
++                          posKey, VALUE_NONE, ss->ttPv, BOUND_NONE, DEPTH_NONE, MOVE_NONE,
++                          eval);
       }
   
       // Use static evaluation difference to improve quiet move ordering
diff --cc src/search.h
Simple merge
diff --cc src/syzygy/tbprobe.cpp
Simple merge
diff --cc src/thread.cpp
Simple merge
diff --cc src/thread.h

index 9ae74448f7a0ab6fc1700c12ccc2f8a71b838591,e04d303a271996ee39f2335997e58e6fd4c3bcf7..04dbf51fe13d0d88b1c6879e0b94cf5bb4f80747
--- 1/src/thread.h
--- 2/src/thread.h
+++ b/src/thread.h
@@@ -60,10 -60,13 +61,13 @@@ public
     Pawns::Table pawnsTable;
     Material::Table materialTable;
     size_t pvIdx, pvLast;
-   uint64_t ttHitAverage;
+   RunningAverage doubleExtensionAverage[COLOR_NB];
+   uint64_t nodesLastExplosive;
+   uint64_t nodesLastNormal;
- -  std::atomic<uint64_t> nodes, tbHits, bestMoveChanges;
++  std::atomic<uint64_t> nodes, tbHits, TTsaves, bestMoveChanges;
     int selDepth, nmpMinPly;
     Color nmpColor;
-   std::atomic<uint64_t> nodes, tbHits, TTsaves, bestMoveChanges;
+   ExplosionState state;
   
     Position rootPos;
     StateInfo rootState;
@@@ -74,14 -77,7 +78,13 @@@
     LowPlyHistory lowPlyHistory;
     CapturePieceToHistory captureHistory;
     ContinuationHistory continuationHistory[2][2];
-   Score contempt;
-   int failedHighCnt;
+   Score trend;
+ +#ifdef USE_MPI
+ +  struct {
+ +      std::mutex mutex;
+ +      Cluster::TTCache<Cluster::TTCacheSize> buffer = {};
+ +  } ttCache;
+ +#endif
   };
   
   
diff --cc src/uci.cpp

index 20f6f1ff94f686c78fb83223ff2eb738fe696e7e,b3738a4a76b8705a668db9bb38bcf879a03ed0d8..ad131044750b393f816dc29df642ce3514559cce
--- 1/src/uci.cpp
--- 2/src/uci.cpp
+++ b/src/uci.cpp
@@@ -279,13 -274,18 +279,21 @@@ void UCI::loop(int argc, char* argv[]) 
         // Do not use these commands during a search!
         else if (token == "flip")     pos.flip();
         else if (token == "bench")    bench(pos, is, states);
- -      else if (token == "d")        sync_cout << pos << sync_endl;
- -      else if (token == "eval")     trace_eval(pos);
- -      else if (token == "compiler") sync_cout << compiler_info() << sync_endl;
- -      else if (token == "export_net")
+ +      else if (token == "d" && Cluster::is_root())
+ +          sync_cout << pos << sync_endl;
+ +      else if (token == "eval" && Cluster::is_root())
+ +          trace_eval(pos);
+ +      else if (token == "compiler" && Cluster::is_root())
+ +          sync_cout << compiler_info() << sync_endl;
++      else if (token == "export_net" && Cluster::is_root())
+       {
+           std::optional<std::string> filename;
+           std::string f;
+           if (is >> skipws >> f)
+               filename = f;
+           Eval::NNUE::save_eval(filename);
+       }
- -      else if (!token.empty() && token[0] != '#')
+ +      else if (!token.empty() && token[0] != '#' && Cluster::is_root())
             sync_cout << "Unknown command: " << cmd << sync_endl;
   
     } while (token != "quit" && argc == 1); // Command line args are one-shot
author	Joost VandeVondele <Joost.VandeVondele@gmail.com>
	Sun, 31 Oct 2021 07:38:10 +0000 (08:38 +0100)
committer	Joost VandeVondele <Joost.VandeVondele@gmail.com>
	Sun, 31 Oct 2021 15:17:24 +0000 (16:17 +0100)
		1	2
README.md	patch \|	diff1 \|	diff2 \|	blob \| history
src/Makefile	patch \|	diff1 \|	diff2 \|	blob \| history
src/cluster.cpp	patch \|	diff1 \|	\|	blob \| history
src/evaluate.cpp	patch \|	diff1 \|	diff2 \|	blob \| history
src/search.cpp	patch \|	diff1 \|	diff2 \|	blob \| history
src/search.h	patch \|	diff1 \|	diff2 \|	blob \| history
src/syzygy/tbprobe.cpp	patch \|	diff1 \|	diff2 \|	blob \| history
src/thread.cpp	patch \|	diff1 \|	diff2 \|	blob \| history
src/thread.h	patch \|	diff1 \|	diff2 \|	blob \| history
src/uci.cpp	patch \|	diff1 \|	diff2 \|	blob \| history