Tendermint Core QA Results v0.37.x

Issues discovered

During this iteration of the QA process, the following issues were found:

(critical, fixed) [#9533] - This bug caused full nodes to sometimes get stuck when blocksyncing, requiring a manual restart to unblock them. Importantly, this bug was also present in v0.34.x and the fix was also backported in [#9534].
(critical, fixed) [#9539] - loadtime is very likely to include more than one “=” character in transactions, with is rejected by the e2e application.
(critical, fixed) [#9581] - Absent prometheus label makes CometBFT crash when enabling Prometheus metric collection
(non-critical, not fixed) [#9548] - Full nodes can go over 50 connected peers, which is not intended by the default configuration.
(non-critical, not fixed) [#9537] - With the default mempool cache setting, duplicated transactions are not rejected when gossipped and eventually flood all mempools. The 200 node testnets were thus run with a value of 200000 (as opposed to the default 10000)

200 Node Testnet

Finding the Saturation Point

The first goal is to identify the saturation point and compare it with the baseline (v0.34.x). For further details, see this paragraph in the baseline version.

The following table summarizes the results for v0.37.x, for the different experiments (extracted from file v037_report_tabbed.txt).

The X axis of this table is c, the number of connections created by the load runner process to the target node. The Y axis of this table is r, the rate or number of transactions issued per second.

	c=1	c=2	c=4
r=25	2225	4450	8900
r=50	4450	8900	17800
r=100	8900	17800	35600
r=200	17800	35600	38660

For comparison, this is the table with the baseline version.

	c=1	c=2	c=4
r=25	2225	4450	8900
r=50	4450	8900	17800
r=100	8900	17800	35400
r=200	17800	35600	37358

The saturation point is beyond the diagonal:

r=200,c=2
r=100,c=4

which is at the same place as the baseline. For more details on the saturation point, see this paragraph in the baseline version.

The experiment chosen to examine Prometheus metrics is the same as in the baseline: r=200,c=2.

The load runner’s CPU load was negligible (near 0) when running r=200,c=2.

Examining latencies

The method described here allows us to plot the latencies of transactions for all experiments.

all-latencies

The data seen in the plot is similar to that of the baseline.

all-latencies-bl

Therefore, for further details on these plots, see this paragraph in the baseline version.

The following plot summarizes average latencies versus overall throughputs across different numbers of WebSocket connections to the node into which transactions are being loaded.

latency-vs-throughput

This is similar to that of the baseline plot:

latency-vs-throughput-bl

Prometheus Metrics on the Chosen Experiment

As mentioned above, the chosen experiment is r=200,c=2. This section further examines key metrics for this experiment extracted from Prometheus data.

Mempool Size

The mempool size, a count of the number of transactions in the mempool, was shown to be stable and homogeneous at all full nodes. It did not exhibit any unconstrained growth. The plot below shows the evolution over time of the cumulative number of transactions inside all full nodes’ mempools at a given time.

mempool-cumulative

The plot below shows evolution of the average over all full nodes, which oscillate between 1500 and 2000 outstanding transactions.

mempool-avg

The peaks observed coincide with the moments when some nodes reached round 1 of consensus (see below).

These plots yield similar results to the baseline:

mempool-cumulative-bl

mempool-avg-bl

Peers

The number of peers was stable at all nodes. It was higher for the seed nodes (around 140) than for the rest (between 16 and 78).

peers

Just as in the baseline, the fact that non-seed nodes reach more than 50 peers is due to #9548.

This plot yields similar results to the baseline:

peers-bl

Consensus Rounds per Height

Most heights took just one round, but some nodes needed to advance to round 1 at some point.

rounds

This plot yields slightly better results than the baseline:

rounds-bl

Blocks Produced per Minute, Transactions Processed per Minute

The blocks produced per minute are the gradient of this plot.

heights

Over a period of 2 minutes, the height goes from 477 to 524. This results in an average of 23.5 blocks produced per minute.

The transactions processed per minute are the gradient of this plot.

total-txs

Over a period of 2 minutes, the total goes from 64525 to 100125 transactions, resulting in 17800 transactions per minute. However, we can see in the plot that all transactions in the load are process long before the two minutes. If we adjust the time window when transactions are processed (approx. 90 seconds), we obtain 23733 transactions per minute.

These plots yield similar results to the baseline:

heights-bl

total-txs

Memory Resident Set Size

Resident Set Size of all monitored processes is plotted below.

rss

The average over all processes oscillates around 380 MiB and does not demonstrate unconstrained growth.

rss-avg

These plots yield similar results to the baseline:

rss-bl

rss-avg-bl

CPU utilization

The best metric from Prometheus to gauge CPU utilization in a Unix machine is load1, as it usually appears in the output of top.

load1

It is contained below 5 on most nodes.

This plot yields similar results to the baseline:

load1

Test Result

Result: PASS

Date: 2022-10-14

Version: 1cf9d8e276afe8595cba960b51cd056514965fd1

Rotating Node Testnet

We use the same load as in the baseline: c=4,r=800.

Just as in the baseline tests, the version of CometBFT used for these tests is affected by #9539. See this paragraph in the baseline report for further details. Finally, note that this setup allows for a fairer comparison between this version and the baseline.

Latencies

The plot of all latencies can be seen here.

rotating-all-latencies

Which is similar to the baseline.

rotating-all-latencies-bl

Note that we are comparing against the baseline plot with unique transactions. This is because the problem with duplicate transactions detected during the baseline experiment did not show up for v0.37, which is not proof that the problem is not present in v0.37.

Prometheus Metrics

The set of metrics shown here match those shown on the baseline (v0.34) for the same experiment. We also show the baseline results for comparison.

Blocks and Transactions per minute

The blocks produced per minute are the gradient of this plot.

rotating-heights

Over a period of 4446 seconds, the height goes from 5 to 3323. This results in an average of 45 blocks produced per minute, which is similar to the baseline, shown below.

rotating-heights-bl

The following two plots show only the heights reported by ephemeral nodes. The second plot is the baseline plot for comparison.

rotating-heights-ephe

rotating-heights-ephe-bl

By the length of the segments, we can see that ephemeral nodes in v0.37 catch up slightly faster.

The transactions processed per minute are the gradient of this plot.

rotating-total-txs

Over a period of 3852 seconds, the total goes from 597 to 267298 transactions in one of the validators, resulting in 4154 transactions per minute, which is slightly lower than the baseline, although the baseline had to deal with duplicate transactions.

For comparison, this is the baseline plot.

rotating-total-txs-bl

Peers

The plot below shows the evolution of the number of peers throughout the experiment.

rotating-peers

This is the baseline plot, for comparison.

rotating-peers-bl

The plotted values and their evolution are comparable in both plots.

For further details on these plots, see the baseline report.

Memory Resident Set Size

The average Resident Set Size (RSS) over all processes looks slightly more stable on v0.37 (first plot) than on the baseline (second plot).

rotating-rss-avg

rotating-rss-avg-bl

The memory taken by the validators and the ephemeral nodes when they are up is comparable (not shown in the plots), just as observed in the baseline.

CPU utilization

The plot shows metric load1 for all nodes.

rotating-load1

rotating-load1-bl

In both cases, it is contained under 5 most of the time, which is considered normal load. The green line in the v0.37 plot and the purple line in the baseline plot (v0.34) correspond to the validators receiving all transactions, via RPC, from the load runner process. In both cases, they oscillate around 5 (normal load). The main difference is that other nodes are generally less loaded in v0.37.

Test Result

Result: PASS

Date: 2022-10-10

Version: 155110007b9d8b83997a799016c1d0844c8efbaf