INDEX
Explanations
words related to safety
references to safety in various contexts
New Auto-Interp
Negative Logits
ovi
-0.75
opus
-0.69
txt
-0.67
gd
-0.66
zzo
-0.65
itus
-0.64
agne
-0.64
Married
-0.64
ago
-0.63
Saharan
-0.62
POSITIVE LOGITS
safety
3.94
safety
3.34
Safety
2.93
Safety
2.92
safer
1.77
safe
1.69
afety
1.67
saf
1.60
SAF
1.58
safest
1.54
Activations Density 0.022%