INDEX
Explanations
attribute assignments key-value pairs
New Auto-Interp
Negative Logits
(+
0.71
\*
0.69
(-)
0.69
(+)
0.69
🙏
0.66
worldview
0.65
0.65
neurotic
0.64
overse
0.64
exh
0.63
POSITIVE LOGITS
="
1.12
='
0.82
也是
0.79
="$
0.72
="${0.71
可以是
0.69
设置为
0.68
="+
0.66
Type
0.66
={{0.65
Activations Density 0.564%