INDEX
Explanations
phrases indicating safety and security
New Auto-Interp
Negative Logits
loth
-0.18
polator
-0.16
ideon
-0.15
oles
-0.15
ÑĢава
-0.15
strcasecmp
-0.15
ogne
-0.14
-piece
-0.14
atre
-0.14
aceous
-0.14
POSITIVE LOGITS
safe
0.31
Safe
0.30
.safe
0.28
safe
0.26
Safe
0.24
-safe
0.22
unsafe
0.22
.Safe
0.21
safely
0.21
safer
0.20
Activations Density 0.027%