INDEX
Explanations
concepts related to safety and security
New Auto-Interp
Negative Logits
Safety
-0.45
safety
-0.45
safely
-0.44
Safety
-0.42
safer
-0.39
safest
-0.36
saf
-0.35
safe
-0.35
Safe
-0.34
safe
-0.34
POSITIVE LOGITS
sound
0.22
Sound
0.21
Sound
0.20
Secure
0.18
Sec
0.17
sound
0.17
secure
0.17
sec
0.17
erville
0.17
SOUND
0.17
Activations Density 0.028%