INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    y
    -0.75
    -0.68
    er
    -0.68
     allow
    -0.64
    allow
    -0.62
     allowed
    -0.59
    en
    -0.57
    ®
    -0.57
    -0.57
    -0.55
    POSITIVE LOGITS
     Efq
    1.29
     pleaſure
    1.20
     himſelf
    1.18
     myſelf
    1.14
     Jefus
    1.12
     Eſ
    1.10
     itſelf
    1.05
     Anſ
    1.05
    extAlignment
    1.05
     Diſ
    1.04
    Act Density 4.128%

    No Known Activations