INDEX
Explanations
phrases related to intentions and morality
New Auto-Interp
Negative Logits
cona
-0.20
roud
-0.16
Seks
-0.16
bable
-0.15
ellipse
-0.14
earch
-0.14
kop
-0.14
éĽij
-0.14
ese
-0.14
phere
-0.14
POSITIVE LOGITS
McCart
0.15
usch
0.15
ather
0.14
.shtml
0.14
ubs
0.14
rippling
0.14
enus
0.14
402
0.14
á»ģ
0.14
羣
0.13
Activations Density 0.247%