A Benchmark and Comprehensive Study of Fallacy Detection and Classification

Chadi Helwe¹Tom Calamai²Pierre-Henri Paris¹Chloé Clavel²Fabian Suchanek¹

1: Télécom Paris     2: Inria


Performance: SOTA LLMs and humans on fallacy detection and classification?

  • Benchmark: unification public existing fallacy collections
  • Taxonomy: covers existing dataset
  • Annotation Scheme: to tackle subjectivity

A fallacy is an erroneous or invalid way of reasoning.

“You must either support my presidential candidacy or be against America!”

This argument is a false dilemma fallacy: it wrongly assumes no other alternatives.

Why bother with fallacies?

First things first, some definitions


An argument consists of an assertion called the conclusion and one or more assertions called premises, where the premises are intended to establish the truth of the conclusion. Premises or conclusions can be implicit in an argument.


A fallacy is an argument where the premises do not entail the conclusion.

Taxonomies of Fallacies

Aristotle - Organon
Whately - Elements of Logic
Bennet - Logically Fallacious

What about taxonomies in current works on fallacy annotation, detection, and classification?

lack of consistency across the different dataset:

  • different granularity
  • different coverage
  • some are hierachical, others are not

Our taxonomy aims to systematize and classify those fallacies that are used in current works.

  • Level 0 is a binary classification
  • Level 1 groups fallacies into Aristotle’s categories:
    • Pathos (appeals to emotion),
    • Ethos (fallacies of credibility),
    • and Logos (fallacies of logic, relevance, or evidence).
  • Level 2 contains fine-grained fallacies within the broad categories of Level 1.

Accompanying the taxonomy, we provide both formal and informal definitions for each fallacies.

An example

“Why should I support the 2nd Amendment, do I look like toothless hick?”

Appeal to ridicule

This fallacy occurs when an opponent's argument is portrayed as absurd or ridiculous with the intention of discrediting it.

$E_1$ claims $P$. $E_2$ makes $P$ look ridiculous, by misrepresenting $P$ ($P$'). Therefore, $\neg P$

Disjunctive Annotation Scheme

In the last New Hampshire primary election, my favorite candidate won. Therefore, he will also win the next primary election.

Is it:

  • a false causality?
  • a causal oversimplification?

Legitimately differing opinions

  • One annotator may see implicit assertions that another annotator does not see.
    • Are you for America? Vote for me!
  • Annotators may also have different thresholds for fear or insults.
  • Different annotators have different background knowledge.
    • Use disinfectants or you will get Covid-19!

Let “$a\ b\ c\ d$” be a text where $a$, $b$, $c$, and $d$ are sentences.

$S= ${$a\ b, d$}, “$a\ b$” has labels {$l_1, l_2$}, and “$d$” has label {$l_3$}.

In that case, $G=${ $(a\ b,$ {$l_1, l_2$}$) , (d, ${$l_3$}$)$ }

An example of prediction could be

$P=${$(a, l${$l_1$}$), (a, ${$l_2$}$), (b, ${$l_3$}$), (c, ${$l_4$}$), (d, ${$l_1$}$)$}

A text is a sequence of sentences $st_1, \ldots, st_n$.

The span of a fallacy in a text is the smallest contiguous sequence of sentences that comprises the conclusion and the premises of the fallacy. If the span comprises a pronoun that refers to a premise or to the conclusion, that premise or conclusion is not included in the span.

Let $\mathcal{F}$ be the set of fallacy types and $\bot$ be a special label which means “no fallacy”. Given a text and its set of spans $S$ (possibly overlapping), then the gold standard $G$ of a text is the set of pairs of all $s \in S$, along with their respective alternative labels, represented as follows:

$G \subseteq S \times \mathcal{P}(\mathcal{F}\cup${$\bot$}$)\setminus${$\emptyset, ${$\bot$}}

Given a text and its segmentation of (possibly overlapping) spans $S$, then a prediction $P$ of the text is a set of pairs of all $s \in S$ paired with exactly one predicted label, represented as follows:

$P\subseteq S \times \mathcal{F}\cup{\bot}$

Comparison score

$C(p, l_p, g, l_g, h) = \frac{|p \cap g|}{h} \times \delta(l_p, l_g)$

$\delta(x,y)=[x \cap y \neq \emptyset]$

$\textit{Precision}(P, G)=\frac{\sum\limits_{(p, l_p)\in P}\max\limits_{(g, l_g)\in G} C(p, l_p, g, l_g, |p|)}{|P|}$

If $|P|=0$, then $\textit{Precision}(P, G)=1$

$\textit{Recall}(P, G)=\frac{\sum\limits_{(g, l_g)\in G^-}\max\limits_{(p, l_p)\in P}C(p, l_p, g, l_g, |g|)}{|G^-|}$

If $|G^-|=0$, then $\textit{Recall}(P, G)=1$

The MAFALDA dataset

Multi-level Annotated FALlacy DAtaset

English-language corpus of 9,745 texts:

  • News outlets: Martino et al. (2019)1
    • γ = 0.26 $\rightarrow$ γ = 0.60
  • Reddit: Sahai et al. (2021)2
    • Cohen’s κ = 0.515
  • Online quizzes and climate-related texts from news outlets: Jin et al. (2022)3
  • American political debates: Goffredo et al. (2022)4
    • Krippendorff’s α = 0.5439

Disagreements are interpreted as noise, and are removed with various strategies.


Why re-annotating?

  • Lack of consistency between taxonomies
  • Annotations scheme variations
  • Variations in the way consensus is achieved

$\rightarrow$ We manually re-annotated from scratch 200 randomly selected texts

  • Manual re-annotation, from scratch, of 200 randomly selected texts from our merged corpus
  • Our sample mirrors the distribution of sources and the original labels in our corpus
  • LLMs and crowd workers were not involved (33%-46% of Mturk workers are estimated to use ChatGPT (Veselovsky et al., 2023))

Re-annotation process

The authors discussed each text together

  1. identifying each argument in a text
  2. determining whether it is fallacious
  3. determining the span of the fallacy
  4. choosing the fallacy type(s)

Performance results

ModelF1 Level 0F1 Level 1F1 Level 2
Baseline random0.4350.0610.010
LLAMA2 Chat0.5720.1140.068
Mistral Instruct 7B0.5360.1440.069
Mistral  7B0.4500.1270.044
Vicuna 7B0.4940.1340.051
WizardLM 7B0.4900.0870.036
Zephyr 7B0.5240.1920.098
LLaMA2 Chat 13B0.5490.1600.096
LLaMA2 13B0.4580.1290.039
Vicuna 13B0.5570.1730.100
WizardLM 13B0.5200.1770.093
GPT 3.5 175B0.6270.2010.138
Avg. Human0.7490.3520.186

User annotations as gold standard

Gold StandardF1 Level 0F1 Level 1F1 Level 2
User 10.6160.3100.119
User 20.6490.3040.098
User 30.6960.2530.093
User 40.6490.2770.144


  • MAFALDA is a unified dataset designed for fallacy detection and classification
  • A new taxonomy of fallacies
  • A new annotation scheme that embraces the subjectivity
  • LLMs in zero-shot fallacy detection and classification at the span level
  • Human evaluation

Future work

  • Extension to few-shot settings
  • Advanced prompting techniques
  • Enriching the dataset
  • Model fine-tuning
  • Top-down approach
  1. Da San Martino, G., Seunghak, Y., Barrón-Cedeno, A., Petrov, R., & Nakov, P. (2019). Fine-grained analysis of propaganda in news article. ↩︎

  2. Sahai, S., Balalau, O., & Horincar, R. (2021, August). Breaking down the invisible wall of informal fallacies in online discussions. ↩︎

  3. Jin, Z., Lalwani, A., Vaidhya, T., Shen, X., Ding, Y., Lyu, Z., … & Schölkopf, B. (2022). Logical fallacy detection. ↩︎

  4. Goffredo, P., Haddadan, S., Vorakitphan, V., Cabrio, E., & Villata, S. (2022, July). Fallacious argument classification in political debates. ↩︎