INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ins
    -0.07
    +y
    -0.06
     Robinson
    -0.06
     sons
    -0.06
    Modified
    -0.06
     aid
    -0.06
    irling
    -0.06
    av
    -0.06
    oured
    -0.05
     Norris
    -0.05
    POSITIVE LOGITS
    the
    0.12
    -the
    0.11
     THE
    0.09
    /the
    0.09
    _THE
    0.09
    	the
    0.09
    athe
    0.09
     θε
    0.08
    -The
    0.08
    THE
    0.08
    Act Density 0.055%

    No Known Activations