INDEX
Explanations
negations or expressions of denial
New Auto-Interp
Negative Logits
rape
-0.15
bee
-0.15
anych
-0.15
nist
-0.15
encers
-0.15
uche
-0.14
ikh
-0.14
ième
-0.14
itters
-0.14
cul
-0.14
POSITIVE LOGITS
matter
0.53
matter
0.41
Matter
0.36
doubt
0.33
wonder
0.32
amount
0.27
sooner
0.24
offense
0.22
mater
0.22
Doub
0.22
Activations Density 0.040%