INDEX
Explanations
instances of moral judgment or criticism of behavior
New Auto-Interp
Negative Logits
stad
-0.17
iris
-0.15
INLINE
-0.15
é¦Ĩ
-0.14
Ore
-0.14
blade
-0.14
quine
-0.14
館
-0.14
å¸ĸ
-0.14
ERE
-0.14
POSITIVE LOGITS
idl
0.15
abor
0.15
Ø´ÙħاÙĦÛĮ
0.15
imas
0.15
ema
0.15
ãĥĨãĥ«
0.14
ets
0.14
Idol
0.14
uib
0.14
Honest
0.13
Activations Density 0.035%