INDEX
Explanations
phrases that convey admonishment or demands for behavioral change
New Auto-Interp
Negative Logits
.nlm
-0.19
.rf
-0.16
acen
-0.15
ÙĨاÙĨ
-0.15
Reuse
-0.15
holm
-0.15
dwarf
-0.15
inge
-0.14
arez
-0.14
Occurred
-0.14
POSITIVE LOGITS
stop
0.25
quit
0.24
tough
0.23
stop
0.23
STOP
0.22
Stop
0.22
-stop
0.21
GT
0.20
Stop
0.20
_stop
0.20
Activations Density 0.211%