INDEX
Explanations
key words related to potential action or capability
New Auto-Interp
Negative Logits
('-0.15
igh
-0.15
instinct
-0.14
(
-0.14
pu
-0.14
damned
-0.14
ication
-0.13
ÌĢ
-0.13
(@
-0.13
‘
-0.13
POSITIVE LOGITS
bic
0.15
chal
0.15
aroo
0.15
<Test
0.14
hud
0.14
ä¿¡
0.14
_rng
0.14
grily
0.13
ourg
0.13
ondere
0.13
Activations Density 0.000%