INDEX
Explanations
concepts related to safety and security
New Auto-Interp
Negative Logits
iesen
-0.16
enstein
-0.16
ihan
-0.16
referrer
-0.15
igor
-0.14
анÑĤи
-0.14
rof
-0.14
203
-0.14
237
-0.14
Transient
-0.13
POSITIVE LOGITS
safety
0.65
Safety
0.54
Safety
0.52
safe
0.47
safer
0.45
å®īåħ¨
0.45
afety
0.44
protection
0.42
safe
0.41
saf
0.41
Activations Density 0.164%