Argument-Quality-Ranking
I ran an evaluation of how good each LLMs are on ranking arguments and build my own RoBERTa model which shows similar accuracy to GPT 5.5
Just completed the final project of my grad NLP class. I ran an evaluation to see how good LLMs are at on finding if an argument is actually good or bad. Given two arguments on the same topic, the model predicts which one is higher quality. I trained and evaluated on argument pairs spanning multiple difficulty levels, and benchmark against GPT-5.5, Llama 3, and Mistral to understand where small fine-tuned models stand relative to frontier LLMs on this task. Results: RoBERTa v3 matches GPT-5.5 at this task (0.657 vs 0.665): 125M param model fine-tuned locally is competitive with a frontier API model | Model | Link | |---|---| | RoBERTa v3 (best) | SambhavSBU/argument-quality-roberta-v3 |