INDEX
Explanations
statements related to social or political commentary, particularly those involving moral or ethical concerns
New Auto-Interp
Negative Logits
sac
-0.15
itol
-0.15
ingerprint
-0.14
ÑĢе
-0.14
nÃŃ
-0.14
_mirror
-0.14
Mor
-0.14
434
-0.13
né
-0.13
té
-0.13
POSITIVE LOGITS
ÐŁÐŀ
0.17
elps
0.16
adas
0.16
.examples
0.15
éIJĺ
0.15
ìĹ¼
0.15
ibel
0.15
.Elements
0.14
èĪĴ
0.14
xis
0.14
Activations Density 0.366%