INDEX
Explanations
negations and dismissive phrases
New Auto-Interp
Negative Logits
roker
-0.15
éº
-0.14
ÑĤÑı
-0.14
кап
-0.14
Deng
-0.13
رÙĬب
-0.13
thro
-0.13
Fury
-0.13
Prompt
-0.13
modulo
-0.13
POSITIVE LOGITS
ucz
0.18
yw
0.16
abus
0.15
yg
0.14
lijah
0.14
abb
0.14
domest
0.14
-hit
0.14
wcs
0.14
Clem
0.14
Activations Density 0.021%