INDEX
    Explanations

    informal language

    New Auto-Interp
    Negative Logits
     y
    -1.04
     Y
    -0.88
    Y
    -0.76
    way
    -0.70
     yi
    -0.57
     vy
    -0.56
    WAY
    -0.56
     yat
    -0.55
     ya
    -0.53
    𝑦
    -0.53
    POSITIVE LOGITS
     Efq
    0.88
     itſelf
    0.85
     myſelf
    0.83
    ſelf
    0.80
     whofe
    0.75
    WriteTagHelper
    0.75
     Shakspeare
    0.74
    ſelves
    0.73
     doubtnut
    0.73
     ſche
    0.71
    Act Density 0.109%

    No Known Activations