INDEX
Explanations
concerns or conflicts related to ethics, positions of authority, or potential wrongdoings
New Auto-Interp
Negative Logits
adv
-0.56
Realms
-0.54
uster
-0.54
atical
-0.53
ionics
-0.53
viz
-0.52
innocuous
-0.52
ãĥ
-0.51
ibles
-0.51
isSpecialOrderable
-0.50
POSITIVE LOGITS
similarly
0.70
similar
0.67
coni
0.63
these
0.59
velt
0.58
sylv
0.58
meanwhile
0.56
FontSize
0.55
unaffected
0.55
this
0.54
Activations Density 1.213%