This document provides a detailed description of the QA process. It is intended to be used by engineers reproducing the experimental setup for future tests of CometBFT.
The (first iteration of the) QA process as described in the RELEASES.md document was applied to version v0.34.x in order to have a set of results acting as benchmarking baseline. This baseline is then compared with results obtained in later versions.
Out of the testnet-based test cases described in the releases document we focused on two of them: 200 Node Test, and Rotating Nodes Test.
This section explains how the tests were carried out for reproducibility purposes.
README.md at the top of the testnet repository to configure Terraform, and doctl.testnets/testnet200.toml onto testnet.toml (do NOT commit this change)VERSION_TAG in the Makefile to the git hash that is to be tested.
VERSION2_WEIGHT is set to 0VERSION2_TAG to the other version you want deployed
in the network.
Then adjust the weight variables VERSION_WEIGHT and VERSION2_WEIGHT to configure the
desired proportion of nodes running each of the two configured versions.README.md to configure and start the 200 node testnet
make terraform-destroy as soon as you are done with the tests (see step 9)As a sanity check, connect to the Prometheus node’s web interface (port 9090)
and check the graph for the cometbft_consensus_height metric. All nodes
should be increasing their heights.
ansible/hosts under section [prometheus].The following URL will display the metrics cometbft_consensus_height and cometbft_mempool_size:
http://<PROMETHEUS-NODE-IP>:9090/classic/graph?g0.range_input=1h&g0.expr=cometbft_consensus_height&g0.tab=0&g1.range_input=1h&g1.expr=cometbft_mempool_size&g1.tab=0
make loadrunners-init. This will copy the loader scripts to the
testnet-load-runner node and install the load tool.testnet-load-runner node in
ansible/hosts under section [loadrunners].ssh into testnet-load-runner.
/root/200-node-loadscript.sh in the load runner
node to provide the IP address of a full node (for example,
validator000). This node will receive all transactions from the
load runner node./root/200-node-loadscript.sh from the load runner node.
tmux in case the ssh session breaks.LOAD_CONNECTIONS, LOAD_TX_RATE, to values that will produce the desired transaction load.LOAD_TOTAL_TIME to 90 (seconds).make retrieve-data to gather all relevant data from the testnet into the orchestrating machine
make retrieve-prometheus-data and make retrieve-blockstore separately.
The end result will be the same.make retrieve-blockstore accepts the following values in makefile variable RETRIEVE_TARGET_HOST
any: (which is the default) picks up a full node and retrieves the blockstore from that node only.all: retrieves the blockstore from all full nodes; this is extremely slow, and consumes plenty of bandwidth,
so use it with care.validator01): retrieves the blockstore from that node only.zip -T on the prometheus.zip file and (one of) the blockstore.db.zip file(s)make terraform-destroy
yes! Otherwise you’re in trouble.The method for extracting the results described here is highly manual (and exploratory) at this stage. The CometBFT team should improve it at every iteration to increase the amount of automation.
blockstore.db folder.go run command to the latest possible.mkdir results
go run github.com/cometbft/cometbft/test/loadtime/cmd/report@3003ef7 --database-type goleveldb --data-dir ./ > results/report.txt
File report.txt contains an unordered list of experiments with varying concurrent connections and transaction rate.
You will need to separate data per experiment.
Create files report01.txt, report02.txt, report04.txt and, for each experiment in file report.txt,
copy its related lines to the filename that matches the number of connections, for example
for cnum in 1 2 4; do echo "$cnum"; grep "Connections: $cnum" results/report.txt -B 2 -A 10 > results/report$cnum.txt; done
report01.txt in ascending tx rate order. Likewise for report02.txt and report04.txt.report.txt, and skip to the next step.report_tabbed.txt by showing the contents report01.txt, report02.txt, report04.txt side by side
sed -i.bak 's/\t/ /g' results/report1.txt.paste results/report1.txt results/report2.txt results/report4.txt | column -s $'\t' -t > report_tabbed.txt go run github.com/cometbft/cometbft/test/loadtime/cmd/report@3003ef7 --database-type goleveldb --data-dir ./ --csv results/raw.csv
latency_throughput.py script.
This plot is useful to visualize the saturation point.latency_plotter.py script.
This script generates a series of plots per experiment and configuration that may
help with visualizing Latency vs Throughput variation.systemd unit).prometheus_plotter.py script for the time window.This section explains how the tests were carried out for reproducibility purposes.
README.md at the top of the testnet repository to configure Terraform, and doctl.testnet_rotating.toml onto testnet.toml (do NOT commit this change)VERSION_TAG to the git hash that is to be tested.make terraform-apply EPHEMERAL_SIZE=25
make terraform-destroy as soon as you are done with the testsREADME.md to configure and start the “stable” part of the rotating node testnettendermint_consensus_height metric.
All nodes should be increasing their heights.make runload LOAD_CONNECTIONS=X LOAD_TX_RATE=Y LOAD_TOTAL_TIME=ZX and Y should reflect a load below the saturation point (see, e.g.,
this paragraph for further info)Z (in seconds) should be big enough to keep running throughout the test, until we manually stop it in step 9.
In principle, a good value for Z is 7200 (2 hours)make rotate to start the script that creates the ephemeral nodes, and kills them when they are caught up.
make runload script.make rotatemake stop-networkmake retrieve-data to gather all relevant data from the testnet into the orchestrating machinezip -T on the prometheus.zip file and (one of) the blockstore.db.zip file(s)make terraform-destroySteps 8 to 10 are highly manual at the moment and will be improved in next iterations.
In order to obtain a latency plot, follow the instructions above for the 200 node experiment,
but the results.txt file contains only one experiment.
As for prometheus, the same method as for the 200 node experiment can be applied.
This section explains how the tests were carried out for reproducibility purposes.
README.md at the top of the testnet repository to configure Terraform, and doctl.varyVESize.toml onto testnet.toml (do NOT commit this change).VERSION_TAG in the Makefile to the git hash that is to be tested.README.md to configure and start the testnet
make terraform-destroy as soon as you are done with the testsROTATE_CONNECTIONS, ROTATE_TX_RATE, to values that will produce the desired transaction load.ROTATE_TOTAL_TIME to 150 (seconds).ITERATIONS to the number of iterations that each configuration should run for.Execute steps 5-10 of the README.md file at the testnet repository.
vote_extension_size
vote_extension_size)
vote_extensions_size in the testnet.toml to the desired value.make configgenANSIBLE_SSH_RETRIES=10 ansible-playbook ./ansible/re-init-testapp.yaml -u root -i ./ansible/hosts --limit=validators -e "testnet_dir=testnet" -f 20make restartmake runload
This will repeat the tests ITERATIONS times every time it is invoked.make retrieve-data
Gathers all relevant data from the testnet into the orchestrating machine, inside folder experiments.
Two subfolders are created, one blockstore DB for a CometBFT validator and one for the Prometheus DB data.zip -T on the prometheus.zip file and (one of) the blockstore.db.zip file(s).make terraform-destroy; don’t forget that you need to type yes for it to complete.In order to obtain a latency plot, follow the instructions above for the 200 node experiment, but:
results.txt file contains only one experimentfor loopsAs for Prometheus, the same method as for the 200 node experiment can be applied.