INDEX
Explanations
statements expressing moral judgments or criticisms
New Auto-Interp
Negative Logits
tamp
-0.15
ÏĢι
-0.15
erap
-0.15
illez
-0.14
ohn
-0.14
artz
-0.14
airo
-0.14
[to
-0.14
oce
-0.14
OMET
-0.14
POSITIVE LOGITS
IVA
0.15
ứ
0.15
utter
0.15
olarity
0.15
orp
0.15
sorry
0.14
ivos
0.14
simply
0.14
ivial
0.14
kimse
0.14
Activations Density 0.380%