INDEX
    Explanations

    mathematical expressions or formulas

    New Auto-Interp
    Negative Logits
    [toxicity=0]
    -0.81
    -0.78
     "
    -0.78
     }
    -0.77
    -0.73
    "
    -0.72
    -
    -0.72
      
    -0.72
    _
    -0.72
    <eos>
    -0.70
    POSITIVE LOGITS
     myſelf
    1.45
    ſelves
    1.37
     itſelf
    1.34
     Theſe
    1.32
     Anſ
    1.30
     himſelf
    1.28
     Monfieur
    1.23
     uſed
    1.23
     raiſ
    1.16
     ſind
    1.16
    Act Density 0.567%

    No Known Activations