INDEX
Negative Logits
safety
-0.63
safe
-0.61
Safe
-0.60
emergency
-0.60
Safety
-0.58
SAFE
-0.57
Safe
-0.56
Shar
-0.56
EnableWeb
-0.56
безопасности
-0.55
POSITIVE LOGITS
défend
0.85
defended
0.79
Defend
0.78
defend
0.74
defends
0.74
défendre
0.74
DEFEND
0.72
defending
0.64
ArrowToggle
0.63
ąb
0.62
Activations Density 0.025%