INDEX
Explanations
phrases relating to safety and priorities in various contexts
New Auto-Interp
Negative Logits
lg
-0.16
ignet
-0.15
Grace
-0.14
usercontent
-0.14
ond
-0.14
grace
-0.14
abar
-0.14
γά
-0.13
cel
-0.13
porto
-0.13
POSITIVE LOGITS
priority
0.46
priorities
0.41
priority
0.39
Priority
0.38
Priority
0.38
priorit
0.31
prioritize
0.31
_priority
0.30
.priority
0.29
(priority
0.27
Activations Density 0.100%