Skip to content

Evaluation

CanopyRS provides tools for finding optimal NMS parameters and benchmarking models on test datasets.

Finding optimal NMS parameters

To find the optimal NMS parameters for your model (nms_iou_threshold and nms_score_threshold), use the find_optimal_raster_nms.py tool script. This script runs a grid search over NMS parameters and evaluates results using COCO evaluation metrics at a chosen IoU threshold.

IoU threshold options

  • --eval_iou_threshold 0.75 for RF1₇₅ (default)
  • --eval_iou_threshold 50:95 for COCO-style sweep (RF150:95)
  • Comma-separated lists are also accepted: --eval_iou_threshold 0.50,0.65,0.80

Example: Finding NMS parameters for RF1₇₅

To find NMS parameters for the DINO Swin-L multi-NQOS detector on the validation set of SelvaBox and Detectree2:

python -m canopyrs.tools.detection.find_optimal_raster_nms \
  -c canopyrs/config/detectors/dino_swinL_multi_NQOS.yaml \
  -d SelvaBox Detectree2 \
  -r <DATA_ROOT> \
  -o <OUTPUT_PATH> \
  --n_workers 6 \
  --eval_iou_threshold 0.75
python -m canopyrs.tools.detection.find_optimal_raster_nms `
  -c canopyrs/config/detectors/dino_swinL_multi_NQOS.yaml `
  -d SelvaBox Detectree2 `
  -r <DATA_ROOT> `
  -o <OUTPUT_PATH> `
  --n_workers 6 `
  --eval_iou_threshold 0.75

Example: Finding NMS parameters for RF150:95

python -m canopyrs.tools.detection.find_optimal_raster_nms \
  -c canopyrs/config/detectors/dino_swinL_multi_NQOS.yaml \
  -d SelvaBox Detectree2 \
  -r <DATA_ROOT> \
  -o <OUTPUT_PATH> \
  --n_workers 6 \
  --eval_iou_threshold 50:95
python -m canopyrs.tools.detection.find_optimal_raster_nms `
  -c canopyrs/config/detectors/dino_swinL_multi_NQOS.yaml `
  -d SelvaBox Detectree2 `
  -r <DATA_ROOT> `
  -o <OUTPUT_PATH> `
  --n_workers 6 `
  --eval_iou_threshold 50:95

Performance notes

Depending on how many rasters there are in the datasets you select, the search could take from a few tens of minutes to a few hours. If you have lots of CPU cores, we recommend increasing the number of workers.

For more information on parameters:

python -m canopyrs.tools.detection.find_optimal_raster_nms --help

Benchmarking

To benchmark a model on test or valid sets of datasets, use the benchmark.py tool script.

This script runs the model and evaluates results using tile-level COCO metrics (mAP and mAR).

Raster-level evaluation (RF1)

To run raster-level evaluation (RF1) in addition to tile-level, you must pass values for --nms_threshold and --score_threshold. To find these parameter values, run find_optimal_raster_nms.py on the validation set of one (or more) datasets, as described above.

The benchmark will then run a single raster-level aggregation with those values and report RF1 at the chosen IoU setting.

Important: Use consistent IoU thresholds

Always use the same --eval_iou_threshold value when finding NMS parameters and when running the final benchmark. If you optimize NMS for RF1₇₅ but benchmark with RF150:95, your NMS parameters will not be optimal for that metric.

Example: Benchmarking with RF150:95

To benchmark the DINO Swin-L multi-NQOS detector on the test set of SelvaBox and Detectree2:

python -m canopyrs.tools.detection.benchmark \
  -c canopyrs/config/detectors/dino_swinL_multi_NQOS.yaml \
  -d SelvaBox Detectree2 \
  -r <DATA_ROOT> \
  -o <OUTPUT_PATH> \
  --nms_threshold 0.7 \
  --score_threshold 0.5 \
  --eval_iou_threshold 50:95
python -m canopyrs.tools.detection.benchmark `
  -c canopyrs/config/detectors/dino_swinL_multi_NQOS.yaml `
  -d SelvaBox Detectree2 `
  -r <DATA_ROOT> `
  -o <OUTPUT_PATH> `
  --nms_threshold 0.7 `
  --score_threshold 0.5 `
  --eval_iou_threshold 50:95

By default, evaluation is done on the test set.

For more information on parameters:

python -m canopyrs.tools.detection.benchmark --help

Programmatic usage

You can also use the benchmarker classes directly in Python for more control.

Detector example

from canopyrs.engine.benchmark import DetectorBenchmarker
from canopyrs.engine.config_parsers import DetectorConfig, AggregatorConfig

detector_config = DetectorConfig.from_yaml("canopyrs/config/detectors/dino_swinL_multi_NQOS.yaml")

# COCO-style IoU sweep for RF1_50:95
eval_ious = [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]

# Detector-only: all confidence comes from the detector
aggregator_base = AggregatorConfig(
    nms_algorithm="iou",
    detector_score_weight=1.0,
    segmenter_score_weight=0.0,
)

# Step 1: Find optimal NMS parameters on validation set
valid_benchmarker = DetectorBenchmarker(
    output_folder="./output/benchmark/valid",
    fold_name="valid",
    raw_data_root="/data/canopyrs",
    eval_iou_threshold=eval_ious,
)

best_aggregator = valid_benchmarker.find_optimal_nms_iou_threshold(
    detector_config=detector_config,
    base_aggregator_config=aggregator_base,
    dataset_names=["SelvaBox", "Detectree2"],
    nms_iou_thresholds=[i / 20 for i in range(1, 21)],# these parameters define over which values the grid search should be ran
    nms_score_thresholds=[i / 20 for i in range(1, 21)],
    n_workers=6,
)

# Step 2: Benchmark on test set using optimal NMS parameters
test_benchmarker = DetectorBenchmarker(
    output_folder="./output/benchmark/test",
    fold_name="test",
    raw_data_root="/data/canopyrs",
    eval_iou_threshold=eval_ious,
)

tile_metrics, raster_metrics = test_benchmarker.benchmark(
    detector_config=detector_config,
    aggregator_config=best_aggregator,
    dataset_names=["SelvaBox", "Detectree2"],
)

Segmenter example (end-to-end)

from canopyrs.engine.benchmark import SegmenterBenchmarker
from canopyrs.engine.config_parsers import SegmenterConfig, AggregatorConfig

segmenter_config = SegmenterConfig.from_yaml("canopyrs/config/segmenters/mask2former_swinL_multi_selvamask.yaml")

# COCO-style IoU sweep for RF1_50:95
eval_ious = [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]

# End-to-end segmenter: all confidence comes from the segmenter
aggregator_base = AggregatorConfig(
    nms_algorithm="iou",
    detector_score_weight=0.0,
    segmenter_score_weight=1.0,
)

# Step 1: Find optimal NMS parameters on validation set
valid_benchmarker = SegmenterBenchmarker(
    output_folder="./output/benchmark/valid",
    fold_name="valid",
    raw_data_root="/data/canopyrs",
    eval_iou_threshold=eval_ious,
)

best_aggregator = valid_benchmarker.find_optimal_nms_iou_threshold(
    segmenter_config=segmenter_config,
    base_aggregator_config=aggregator_base,
    dataset_names=["SelvaMask"],
    nms_iou_thresholds=[i / 20 for i in range(1, 21)],
    nms_score_thresholds=[i / 20 for i in range(1, 21)],
    n_workers=6,
)

# Step 2: Benchmark on test set using optimal NMS parameters
test_benchmarker = SegmenterBenchmarker(
    output_folder="./output/benchmark/test",
    fold_name="test",
    raw_data_root="/data/canopyrs",
    eval_iou_threshold=eval_ious,
)

tile_metrics, raster_metrics = test_benchmarker.benchmark(
    segmenter_config=segmenter_config,
    aggregator_config=best_aggregator,
    dataset_names=["SelvaMask"],
)

Segmenter example (detector + prompted SAM3)

from canopyrs.engine.benchmark import SegmenterBenchmarker
from canopyrs.engine.config_parsers import DetectorConfig, SegmenterConfig, AggregatorConfig

detector_config = DetectorConfig.from_yaml("canopyrs/config/detectors/dino_swinL_multi_NQOS_selvamask_FT.yaml")
segmenter_config = SegmenterConfig.from_yaml("canopyrs/config/segmenters/sam3_multi_selvamask_FT.yaml")

# COCO-style IoU sweep for RF1_50:95
eval_ious = [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]

# Prompted SAM3: final score is a weighted blend of detector and segmenter confidence.
# Equal weights (0.5/0.5) is a reasonable default; rebalance if one model is more reliable.
aggregator_base = AggregatorConfig(
    nms_algorithm="ioa-disambiguate",
    detector_score_weight=0.5,
    segmenter_score_weight=0.5,
)

# Step 1: Find optimal NMS parameters on validation set
valid_benchmarker = SegmenterBenchmarker(
    output_folder="./output/benchmark/valid",
    fold_name="valid",
    raw_data_root="/data/canopyrs",
    eval_iou_threshold=eval_ious,
)

best_aggregator = valid_benchmarker.find_optimal_nms_iou_threshold(
    segmenter_config=segmenter_config,
    prompter_detector_config=detector_config,
    base_aggregator_config=aggregator_base,
    dataset_names=["SelvaMask"],
    nms_iou_thresholds=[i / 20 for i in range(1, 21)],      # these parameters define over which values the grid search should be ran
    nms_score_thresholds=[i / 20 for i in range(1, 21)],
    n_workers=6,
)

# Step 2: Benchmark on test set
test_benchmarker = SegmenterBenchmarker(
    output_folder="./output/benchmark/test",
    fold_name="test",
    raw_data_root="/data/canopyrs",
    eval_iou_threshold=eval_ious,
)

tile_metrics, raster_metrics = test_benchmarker.benchmark(
    segmenter_config=segmenter_config,
    prompter_detector_config=detector_config,
    aggregator_config=best_aggregator,
    dataset_names=["SelvaMask"],
)

Aggregating results across seeds

If you ran multiple training seeds, you can compute mean/std across runs for both tile-level and raster-level metrics, then merge them into a single table:

# Assuming you collected per-seed results:
# tile_metrics_list = [tile_metrics_seed1, tile_metrics_seed2, tile_metrics_seed3]
# raster_metrics_list = [raster_metrics_seed1, raster_metrics_seed2, raster_metrics_seed3]

summary_tile = DetectorBenchmarker.compute_mean_std_metric_tables(
    tile_metrics_list,
    output_csv="./output/benchmark/tile_summary.csv",
)

summary_raster = DetectorBenchmarker.compute_mean_std_metric_tables(
    raster_metrics_list,
    output_csv="./output/benchmark/raster_summary.csv",
)

# Merge tile and raster summaries into one table
combined = DetectorBenchmarker.merge_tile_and_raster_summaries(
    summary_tile,
    summary_raster,
    output_csv="./output/benchmark/combined_summary.csv",
    tile_prefix="tile",
    raster_prefix="raster",
)