INDEX
    Explanations

    segments related to programming syntax, particularly variable names and function definitions

    New Auto-Interp
    Negative Logits
    -
    -0.81
    [toxicity=0]
    -0.60
     p
    -0.60
    /
    -0.60
     “
    -0.57
     a
    -0.56
    dymyr
    -0.55
     or
    -0.55
     b
    -0.55
    (
    -0.55
    POSITIVE LOGITS
     itſelf
    1.64
     myſelf
    1.60
     himſelf
    1.55
    ſelves
    1.47
     Anſ
    1.47
     themſelves
    1.45
     Reſ
    1.39
     Conſ
    1.37
    ſelf
    1.36
     Majefty
    1.34
    Act Density 0.477%

    No Known Activations