INDEX
    Explanations

    phrases related to safety measures and instructions

    instances of numerical values or quantities

    New Auto-Interp
    Negative Logits
     censored
    -0.89
     exiled
    -0.80
     drawn
    -0.77
     waged
    -0.77
     defending
    -0.75
     committed
    -0.75
     neighb
    -0.74
     outraged
    -0.74
     fleeing
    -0.74
     dubbed
    -0.74
    POSITIVE LOGITS
    If
    1.65
    Use
    1.63
    Example
    1.63
    Conclusion
    1.60
    Avoid
    1.58
    Important
    1.57
    Tip
    1.57
    Lastly
    1.55
    Examples
    1.55
    Finally
    1.54
    Act Density 0.293%

    No Known Activations