© 2023 The Author(s)Assessing the classification performance of ordinal classifiers is a challenging problem under imbalanced data compositions. Considering the critical impact of the metrics on the choice of classifiers, employing a metric with the highest performance is crucial. Although Cohen's kappa measure is used for performance assessment, there are better-performing agreement measures under different formations of ordinal confusion matrices. This research implements weighted agreement measures as evaluation metrics for ordinal classifiers. The applicability of agreement and mainstream performance metrics to various practice fields under challenging data compositions is assessed. The sensitivity of the metrics in detecting subtle distinctions between ordinal classifiers is analyzed. Five kappa-like agreement measures with six weighting schemes are employed as evaluation metrics. Their reliability/usefulness is compared to the mainstream and recently proposed metrics, including F1, Matthews correlation coefficient, and informational agreement. The performance of 37 metrics is analyzed in two extensive numerical studies, including synthetic confusion matrices and real datasets. Promising metrics under practical circumstances are identified, and recommendations about the best metric to evaluate ordinal classifiers under different conditions are made. Overall, the weighted Scott's pi-measure is found useful, sensitive to small differences in the classification performance, and reliable under general conditions.