INDEX
Explanations
expressions related to criticism and social accountability
New Auto-Interp
Negative Logits
dv
-0.16
orage
-0.15
pron
-0.15
Blank
-0.14
bolt
-0.14
ãĥ¡ãĥ©
-0.14
ãĥ¼ãĥĨ
-0.14
Blank
-0.14
æĭį
-0.14
ếp
-0.14
POSITIVE LOGITS
stop
0.25
Stop
0.24
_stop
0.24
Stop
0.23
stop
0.22
quit
0.22
-stop
0.22
STOP
0.21
_STOP
0.21
STOP
0.21
Activations Density 0.263%