INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     So
    -0.66
     Not
    -0.60
     In
    -0.59
     Such
    -0.59
     And
    -0.56
     To
    -0.55
     Or
    -0.55
     Be
    -0.54
     Has
    -0.53
     Worse
    -0.53
    POSITIVE LOGITS
     can
    0.99
     Offisielt
    0.80
    ftagPool
    0.80
     might
    0.78
     ſhall
    0.77
     Majefty
    0.77
    tagez
    0.75
     Efq
    0.75
     could
    0.74
    tagHelperRunner
    0.74
    Act Density 0.026%

    No Known Activations