INDEX
Negative Logits
_orientation
-0.07
Dou
-0.07
continuum
-0.07
decorated
-0.07
τολ
-0.07
َّ
-0.07
flourish
-0.06
Boulder
-0.06
دختر
-0.06
Chr
-0.06
POSITIVE LOGITS
safety
0.12
safe
0.11
Safe
0.10
safer
0.09
Safety
0.09
안전
0.08
unsafe
0.08
saf
0.08
afe
0.08
Safe
0.08
Activations Density 0.034%