INDEX
Explanations
terms related to safety and risk assessment
New Auto-Interp
Negative Logits
complexContent
-0.88
="@+
-0.72
pró
-0.70
✨:
-0.69
Hooper
-0.69
__":
-0.68
MenuView
-0.67
Wiktionnaire
-0.66
chaun
-0.65
JMenu
-0.65
POSITIVE LOGITS
SAFE
1.50
Safe
1.50
SAFE
1.46
safe
1.44
Safe
1.38
safer
1.32
safe
1.26
SAFETY
1.26
safety
1.24
safest
1.24
Activations Density 0.039%