INDEX
Explanations
phrases related to safety and security
references to safety in various contexts
New Auto-Interp
Negative Logits
dx
-0.72
yi
-0.69
Fiber
-0.68
ordan
-0.67
issance
-0.67
frey
-0.66
essee
-0.65
attr
-0.64
hour
-0.63
bender
-0.63
POSITIVE LOGITS
safe
1.12
Safe
0.89
safe
0.84
havens
0.84
evacuation
0.80
safest
0.79
safely
0.79
safer
0.77
Haram
0.76
saf
0.76
Activations Density 0.013%