INDEX
Explanations
claims related to political discourse and their credibility
New Auto-Interp
Negative Logits
noDo
-0.51
oplayer
-0.43
imitating
-0.36
חוש
-0.35
Photocase
-0.35
Πηγές
-0.35
Anhalt
-0.34
inconspicuous
-0.34
tomation
-0.34
ilingual
-0.34
POSITIVE LOGITS
fiction
0.79
fabrication
0.71
unsub
0.71
fantasy
0.71
fanciful
0.70
fic
0.66
unsupported
0.65
base
0.65
fabricated
0.65
fiction
0.63
Activations Density 0.920%