INDEX
Explanations
facts or claims regarding events or situations
references to legal issues or allegations involving denial and confirmation
New Auto-Interp
Negative Logits
ktop
-0.85
temptation
-0.81
venge
-0.77
tempt
-0.77
pleasures
-0.74
crave
-0.73
itch
-0.72
opian
-0.71
goodbye
-0.70
patience
-0.69
POSITIVE LOGITS
facts
1.16
debunked
1.12
contradicted
1.09
testified
1.05
lied
1.04
untrue
1.01
Facts
1.00
contradicts
0.97
FALSE
0.97
Fact
0.97
Activations Density 0.749%