INDEX
Explanations
negations and expressions of denial or refusal
New Auto-Interp
Negative Logits
endet
-0.15
å©Ĩ
-0.15
hurst
-0.15
stras
-0.15
onia
-0.15
ipar
-0.15
ë¡Ń
-0.14
akra
-0.14
astr
-0.14
oksen
-0.14
POSITIVE LOGITS
tingham
0.18
te
0.15
ye
0.15
lus
0.15
ori
0.15
ches
0.15
rac
0.14
laz
0.14
Dess
0.14
cher
0.14
Activations Density 0.075%