This document provides a detailed description of the QA process. It is intended to be used by engineers reproducing the experimental setup for future tests of CometBFT.
The (first iteration of the) QA process as described in the RELEASES.md document was applied to version v0.34.x in order to have a set of results acting as benchmarking baseline. This baseline is then compared with results obtained in later versions. See RELEASES.md for a description of the tests that we run on the QA process.
Out of the testnet-based test cases described in the releases document we focused on two of them: 200 Node Test, and Rotating Nodes Test.
This test consists in spinning up 200 nodes (175 validators + 20 full nodes + 5 seed nodes) and performing two experiments:
The script 200-node-loadscript.sh runs multiple transaction load instances with all possible combinations of the following parameters:
Additionally:
120 + rate /60
seconds more.This section explains how the tests were carried out for reproducibility purposes.
README.md
at the top of the testnet repository to configure Terraform and the DigitalOcean CLI (doctl
).experiment.mk
file, set the following variables (do NOT commit these changes):
MANIFEST
to point to the file testnets/200-nodes-with-zones.toml
.VERSION_TAG
to the git hash that is to be tested.
VERSION2_WEIGHT
is set to 0VERSION2_TAG
to the other version you want deployed
in the network.
Then adjust the weight variables VERSION_WEIGHT
and VERSION2_WEIGHT
to configure the
desired proportion of nodes running each of the two configured versions.README.md
to configure and start the 200 node testnet.
make terraform-destroy
as soon as you are done with the tests (see step 9)As a sanity check, connect to the Prometheus node’s web interface (port 9090)
and check the graph for the cometbft_consensus_height
metric. All nodes
should be increasing their heights.
ansible --list-hosts prometheus
to obtain the Prometheus node’s IP address.The following URL will display the metrics cometbft_consensus_height
and cometbft_mempool_size
:
http://<PROMETHEUS-NODE-IP>:9090/classic/graph?g0.range_input=1h&g0.expr=cometbft_consensus_height&g0.tab=0&g1.range_input=1h&g1.expr=cometbft_mempool_size&g1.tab=0
make loadrunners-init
, in case the load runner is not yet initialised. This will copy the
loader scripts to the testnet-load-runner
node and install the load tool.ansible --list-hosts loadrunners
to find the IP address of the testnet-load-runner
node.ssh
into testnet-load-runner
.tmux
in case the ssh session breaks.
tmux
quick cheat sheet: ctrl-b a
to attach to an existing session; ctrl-b %
to split the current pane vertically; ctrl-b ;
to toggle to last active pane.validator000
). This node will receive all transactions from the load runner node./root/200-node-loadscript.sh <INTERNAL_IP>
from the load runner node, where <INTERNAL_IP>
is the internal IP address of a full node.
report_tabbed.txt
.LOAD_CONNECTIONS
, LOAD_TX_RATE
, to values that will produce the desired transaction load.LOAD_TOTAL_TIME
to 91 (seconds). The extra second is because the last transaction batch
coincides with the end of the experiment and is thus not sent.make runload
and wait for it to complete. You may want to run this several times so the data from different runs can be compared.make retrieve-data
to gather all relevant data from the testnet into the orchestrating machine
make retrieve-prometheus-data
and make retrieve-blockstore
separately.
The end result will be the same.make retrieve-blockstore
accepts the following values in makefile variable RETRIEVE_TARGET_HOST
any
: (which is the default) picks up a full node and retrieves the blockstore from that node only.all
: retrieves the blockstore from all full nodes; this is extremely slow, and consumes plenty of bandwidth,
so use it with care.validator01
): retrieves the blockstore from that node only.zip -T
on the prometheus.zip
file and (one of) the blockstore.db.zip
file(s)make terraform-destroy
yes
! Otherwise you’re in trouble.The method for extracting the results described here is highly manual (and exploratory) at this stage. The CometBFT team should improve it at every iteration to increase the amount of automation.
For identifying the saturation point, run from the qa-infra
repository:
./script/reports/saturation-gen-table.sh <experiments-blockstore-dir>
where <experiments-blockstore-dir>
is the directory where the results of the experiments were downloaded.
This directory should contain the file blockstore.db.zip
. The script will automatically:
blockstore.db.zip
, if not already.test/loadtime/cmd/report
to extract data for all instances with different
transaction load.
report.txt
that contains an unordered list of
experiments results with varying concurrent connections and transaction rate.report_tabbed.txt
with results formatted as a matrix, where rows are a particular tx rate and
columns are a particular number of websocket connections.saturation_table.tsv
which just contains columns with the number of processed transactions;
this is handy to create a Markdown table for the report.For generating images on latency, run from the qa-infra
repository:
./script/reports/latencies-gen-images.sh <experiments-blockstore-dir>
As above, <experiments-blockstore-dir>
should contain the file blockstore.db.zip
.
The script will automatically:
blockstore.db.zip
, if not already.results/raw.csv
using the tool test/loadtime/cmd/report
.latency_throughput.py
. This plot is useful to
visualize the saturation point.latency_plotter.py
. This plots may help with visualizing latency vs.
throughput variation.qa-infra
repository, run:
./script/reports/prometheus-start-local.sh <experiments-prometheus-dir>
where <experiments-prometheus-dir>
is the directory where the results of the experiments were
downloaded. This directory should contain the file blockstore.db.zip
. This script will:
localhost:9090
, bootstrapping the downloaded data
as database. ./script/reports/prometheus-gen-images.sh <experiments-prometheus-dir> <start-time> <duration> [<test-case>] [<release-name>]
where <start-time>
is in the format '%Y-%m-%dT%H:%M:%SZ'
and <duration>
is in seconds.
This will download, set up a Python virtual environment with required dependencies, and execute
the script prometheus_plotter.py
. The optional parameter <test-case>
is one of 200_nodes
(default), rotating
, and vote_extensions
; <release-name>
is just for putting in the title
of the plot.
This section explains how the tests were carried out for reproducibility purposes.
README.md
at the top of the testnet repository to
configure Terraform, and doctl
.experiment.mk
file, set the following variables (do NOT commit these changes):
MANIFEST
to point to the file testnets/rotating.toml
.VERSION_TAG
to the git hash that is to be tested.EPHEMERAL_SIZE
to 25.README.md
to configure and start the
the rotating node testnet.
make terraform-destroy
as soon as you are done with the tests.cometbft_consensus_height
metric. All nodes should be increasing their heights.make loadrunners-init
to initialize the load runner.make runload ITERATIONS=1 LOAD_CONNECTIONS=X LOAD_TX_RATE=Y LOAD_TOTAL_TIME=Z
X
and Y
should reflect a load below the saturation point (see, e.g.,
this paragraph for further info)Z
(in seconds) should be big enough to keep running throughout the test, until we manually stop it in step 7.
In principle, a good value for Z
is 7200
(2 hours)make rotate
to start the script that creates the ephemeral nodes, and kills them when they are caught up.
make runload
script.make rotate
.make stop-network
.make retrieve-data
to gather all relevant data from the testnet into the orchestrating machinezip -T
on the prometheus.zip
file and (one of) the blockstore.db.zip
file(s)make terraform-destroy
Steps 8 to 10 are highly manual at the moment and will be improved in next iterations.
In order to obtain a latency plot, follow the instructions above for the 200 node experiment,
but the results.txt
file contains only one experiment.
As for prometheus, the same method as for the 200 node experiment can be applied.
This section explains how the tests were carried out for reproducibility purposes.
README.md
at the top of the testnet repository to
configure Terraform, and doctl
.experiment.mk
file, set the following variables (do NOT commit these changes):
MANIFEST
to point to the file testnets/varyVESize.toml
.VERSION_TAG
to the git hash that is to be tested.README.md
to configure and start
the testnet.
make terraform-destroy
as soon as you are done with the testsROTATE_CONNECTIONS
, ROTATE_TX_RATE
, to values that will produce the desired transaction load.ROTATE_TOTAL_TIME
to 150 (seconds).ITERATIONS
to the number of iterations that each configuration should run for.README.md
file at the testnet repository.vote_extension_size
vote_extension_size
)
vote_extensions_size
in the testnet.toml
to the desired value.make configgen
ANSIBLE_SSH_RETRIES=10 ansible-playbook ./ansible/re-init-testapp.yaml -u root -i ./ansible/hosts --limit=validators -e "testnet_dir=testnet" -f 20
make restart
make runload
This will repeat the tests ITERATIONS
times every time it is invoked.make retrieve-data
Gathers all relevant data from the testnet into the orchestrating machine, inside folder experiments
.
Two subfolders are created, one blockstore DB for a CometBFT validator and one for the Prometheus DB data.zip -T
on the prometheus.zip
file and (one of) the blockstore.db.zip
file(s).make terraform-destroy
; don’t forget that you need to type yes for it to complete.In order to obtain a latency plot, follow the instructions above for the 200 node experiment, but:
results.txt
file contains only one experimentfor
loopsAs for Prometheus, the same method as for the 200 node experiment can be applied.