INDEX
Explanations
phrases related to safety measures and instructions
instances of numerical values or quantities
New Auto-Interp
Negative Logits
censored
-0.89
exiled
-0.80
drawn
-0.77
waged
-0.77
defending
-0.75
committed
-0.75
neighb
-0.74
outraged
-0.74
fleeing
-0.74
dubbed
-0.74
POSITIVE LOGITS
If
1.65
Use
1.63
Example
1.63
Conclusion
1.60
Avoid
1.58
Important
1.57
Tip
1.57
Lastly
1.55
Examples
1.55
Finally
1.54
Activations Density 0.293%