INDEX
Explanations
negative judgment and insults
New Auto-Interp
Negative Logits
unexpected
0.45
ảo
0.42
Lx
0.42
✨
0.42
is
0.41
negative
0.41
arci
0.41
有问题
0.41
COX
0.41
ALWAYS
0.41
POSITIVE LOGITS
stupid
0.64
disgusting
0.59
stupidity
0.56
🤮
0.56
plut
0.55
insults
0.55
vulgar
0.54
lousy
0.54
まとめ
0.53
inept
0.53
Activations Density 0.105%