Method

This document provides a detailed description of the QA process. It is intended to be used by engineers reproducing the experimental setup for future tests of CometBFT.

The (first iteration of the) QA process as described in the RELEASES.md document was applied to version v0.34.x in order to have a set of results acting as benchmarking baseline. This baseline is then compared with results obtained in later versions.

Out of the testnet-based test cases described in the releases document we focused on two of them: 200 Node Test, and Rotating Nodes Test.

Software Dependencies

Infrastructure Requirements to Run the Tests

An account at Digital Ocean (DO), with a high droplet limit (>202)
The machine to orchestrate the tests should have the following installed:
- A clone of the testnet repository
  - This repository contains all the scripts mentioned in the reminder of this section
- Digital Ocean CLI
- Terraform CLI
- Ansible CLI

Requirements for Result Extraction

Matlab or Octave
Prometheus server installed
blockstore DB of one of the full nodes in the testnet
Prometheus DB

200 Node Testnet

Running the test

This section explains how the tests were carried out for reproducibility purposes.

[If you haven’t done it before] Follow steps 1-4 of the README.md at the top of the testnet repository to configure Terraform, and doctl.
Copy file testnets/testnet200.toml onto testnet.toml (do NOT commit this change)
Set the variable VERSION_TAG in the Makefile to the git hash that is to be tested.
- If you are running the base test, which implies an homogeneous network (all nodes are running the same version), then make sure makefile variable VERSION2_WEIGHT is set to 0
- If you are running a mixed network, set the variable VERSION_TAG2 to the other version you want deployed in the network. The, adjust the weight variables VERSION_WEIGHT and VERSION2_WEIGHT to configure the desired proportion of nodes running each of the two configured versions.
Follow steps 5-10 of the README.md to configure and start the 200 node testnet
- WARNING: Do NOT forget to run make terraform-destroy as soon as you are done with the tests (see step 9)
As a sanity check, connect to the Prometheus node’s web interface and check the graph for the COMETBFT_CONSENSUS_HEIGHT metric. All nodes should be increasing their heights.
You now need to start the load runner that will produce transaction load
- If you don’t know the saturation load of the version you are testing, you need to discover it.
  - ssh into the testnet-load-runner, then copy script script/200-node-loadscript.sh and run it from the load runner node.
  - Before running it, you need to edit the script to provide the IP address of a full node. This node will receive all transactions from the load runner node.
  - This script will take about 40 mins to run.
  - It is running 90-seconds-long experiments in a loop with different loads.
- If you already know the saturation load, you can simply run the test (several times) for 90 seconds with a load somewhat below saturation:
  - set makefile variables ROTATE_CONNECTIONS, ROTATE_TX_RATE, to values that will produce the desired transaction load.
  - set ROTATE_TOTAL_TIME to 90 (seconds).
  - run “make runload” and wait for it to complete. You may want to run this several times so the data from different runs can be compared.
Run make retrieve-data to gather all relevant data from the testnet into the orchestrating machine
- Alternatively, you may want to run make retrieve-prometheus-data and make retrieve-blockstore separately. The end result will be the same.
- make retrieve-blockstore accepts the following values in makefile variable RETRIEVE_TARGET_HOST
  - any: (which is the default) picks up a full node and retrieves the blockstore from that node only.
  - all: retrieves the blockstore from all full nodes; this is extremely slow, and consumes plenty of bandwidth, so use it with care.
  - the name of a particular full node (e.g., validator01): retrieves the blockstore from that node only.
Verify that the data was collected without errors
- at least one blockstore DB for a CometBFT validator
- the Prometheus database from the Prometheus node
- for extra care, you can run zip -T on the prometheus.zip file and (one of) the blockstore.db.zip file(s)
Run make terraform-destroy
- Don’t forget to type yes! Otherwise you’re in trouble.

Result Extraction

The method for extracting the results described here is highly manual (and exploratory) at this stage. The CometBFT team should improve it at every iteration to increase the amount of automation.

Steps

Unzip the blockstore into a directory

Extract the latency report and the raw latencies for all the experiments. Run these commands from the directory containing the blockstore

 mkdir results
 go run github.com/cometbft/cometbft/test/loadtime/cmd/report@f1aaa436d --database-type goleveldb --data-dir ./ > results/report.txt`
 go run github.com/cometbft/cometbft/test/loadtime/cmd/report@f1aaa436d --database-type goleveldb --data-dir ./ --csv results/raw.csv`

File report.txt contains an unordered list of experiments with varying concurrent connections and transaction rate
- If you are looking for the saturation point
  - Create files report01.txt, report02.txt, report04.txt and, for each experiment in file report.txt, copy its related lines to the filename that matches the number of connections, for example
    for cnum in 1 2 3 4; do echo "$cnum"; grep "Connections: $cnum" results/report.txt -B 2 -A 10 > results/report$cnum.txt; done
  - Sort the experiments in report01.txt in ascending tx rate order. Likewise for report02.txt and report04.txt.
- Otherwise just keep report.txt, and skip step 4.
Generate file report_tabbed.txt by showing the contents report01.txt, report02.txt, report04.txt side by side
- This effectively creates a table where rows are a particular tx rate and columns are a particular number of websocket connections.

Extract the raw latencies from file raw.csv using the following bash loop. This creates a .csv file and a .dat file per experiment. The format of the .dat files is amenable to loading them as matrices in Octave.

Adapt the values of the for loop variables according to the experiments that you ran (check report.txt).
Adapt report*.txt to the files you produced in step 3.

 uuids=($(cat report01.txt report02.txt report04.txt | grep '^Experiment ID: ' | awk '{ print $3 }'))
 c=1
 rm -f *.dat
 for i in 01 02 04; do
   for j in 0025 0050 0100 0200; do
     echo $i $j $c "${uuids[$c]}"
     filename=c${i}_r${j}
     grep ${uuids[$c]} raw.csv > ${filename}.csv
     cat ${filename}.csv | tr , ' ' | awk '{ print $2, $3 }' >> ${filename}.dat
     c=$(expr $c + 1)
   done
 done

Enter Octave

Load all .dat files generated in step 5 into matrices using this Octave code snippet

 conns =  { "01"; "02"; "04" };
 rates =  { "0025"; "0050"; "0100"; "0200" };
 for i = 1:length(conns)
   for j = 1:length(rates)
     filename = strcat("c", conns{i}, "_r", rates{j}, ".dat");
     load("-ascii", filename);
   endfor
 endfor

Set variable release to the current release undergoing QA
```
 release = "v0.34.x";
```

Generate a plot with all (or some) experiments, where the X axis is the experiment time, and the y axis is the latency of transactions. The following snippet plots all experiments.

 legends = {};
 hold off;
 for i = 1:length(conns)
   for j = 1:length(rates)
     data_name = strcat("c", conns{i}, "_r", rates{j});
     l = strcat("c=", conns{i}, " r=", rates{j});
     m = eval(data_name); plot((m(:,1) - min(m(:,1))) / 1e+9, m(:,2) / 1e+9, ".");
     hold on;
     legends(1, end+1) = l;
   endfor
 endfor
 legend(legends, "location", "northeastoutside");
 xlabel("experiment time (s)");
 ylabel("latency (s)");
 t = sprintf("200-node testnet - %s", release);
 title(t);

Consider adjusting the axis, in case you want to compare your results to the baseline, for instance
```
axis([0, 100, 0, 30], "tic");
```
Use Octave’s GUI menu to save the plot (e.g. as .png)
Repeat steps 9 and 10 to obtain as many plots as deemed necessary.
To generate a latency vs throughput plot, using the raw CSV file generated in step 2, follow the instructions for the latency_throughput.py script. This plot is useful to visualize the saturation point.

Alternatively, follow the instructions for the latency_plotter.py script. This script generates a series of plots per experiment and configuration that my help with visualizing Latency vs Throughput variation.

Extracting Prometheus Metrics

Stop the prometheus server if it is running as a service (e.g. a systemd unit).
Unzip the prometheus database retrieved from the testnet, and move it to replace the local prometheus database.
Start the prometheus server and make sure no error logs appear at start up.
Identify the time window you want to plot in your graphs.
Execute the prometheus_plotter.py script for the time window.

Rotating Node Testnet

Running the test

This section explains how the tests were carried out for reproducibility purposes.

[If you haven’t done it before] Follow steps 1-4 of the README.md at the top of the testnet repository to configure Terraform, and doctl.
Copy file testnet_rotating.toml onto testnet.toml (do NOT commit this change)
Set variable VERSION_TAG to the git hash that is to be tested.
Run make terraform-apply EPHEMERAL_SIZE=25
- WARNING: Do NOT forget to run make terraform-destroy as soon as you are done with the tests
Follow steps 6-10 of the README.md to configure and start the “stable” part of the rotating node testnet
As a sanity check, connect to the Prometheus node’s web interface and check the graph for the tendermint_consensus_height metric. All nodes should be increasing their heights.
On a different shell,
- run make runload ROTATE_CONNECTIONS=X ROTATE_TX_RATE=Y
- X and Y should reflect a load below the saturation point (see, e.g., this paragraph for further info)
Run make rotate to start the script that creates the ephemeral nodes, and kills them when they are caught up.
- WARNING: If you run this command from your laptop, the laptop needs to be up and connected for full length of the experiment.
When the height of the chain reaches 3000, stop the make rotate script
When the rotate script has made two iterations (i.e., all ephemeral nodes have caught up twice) after height 3000 was reached, stop make rotate
Run make retrieve-data to gather all relevant data from the testnet into the orchestrating machine
Verify that the data was collected without errors
- at least one blockstore DB for a CometBFT validator
- the Prometheus database from the Prometheus node
- for extra care, you can run zip -T on the prometheus.zip file and (one of) the blockstore.db.zip file(s)
Run make terraform-destroy

Steps 8 to 10 are highly manual at the moment and will be improved in next iterations.

Result Extraction

In order to obtain a latency plot, follow the instructions above for the 200 node experiment, but:

The results.txt file contains only one experiment
Therefore, no need for any for loops

As for prometheus, the same method as for the 200 node experiment can be applied.