INDEX
Explanations
phrases associated with deception or inaccuracies in claims
New Auto-Interp
Negative Logits
aires
-0.15
McGr
-0.15
IDS
-0.15
ctors
-0.14
cctor
-0.14
Dog
-0.14
Westbrook
-0.13
flix
-0.13
gende
-0.13
Cou
-0.13
POSITIVE LOGITS
hte
0.16
hatt
0.16
Pey
0.15
assa
0.15
stal
0.14
ล
0.14
avad
0.14
URN
0.14
urn
0.14
crash
0.13
Activations Density 0.168%