INDEX
Explanations
phrases related to presenting or emphasizing a specific point or belief
expressions that indicate assertion or belief
New Auto-Interp
Negative Logits
ÂŃ
-0.75
ÃĤ
-0.66
withd
-0.57
Composite
-0.56
Vaugh
-0.55
-0.53
âĶ
-0.52
âĢł
-0.51
Azerb
-0.51
âĢ
-0.50
POSITIVE LOGITS
fallacy
0.85
lessly
0.73
doesnt
0.73
liness
0.70
implies
0.67
wolves
0.65
oneself
0.65
?),
0.64
hath
0.64
ndra
0.63
Activations Density 0.934%