INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Behaviors
    -1.25
    Behavioral
    -1.09
    behavioral
    -1.09
     behaviors
    -1.09
     Behavioral
    -1.02
     Behavioural
    -1.02
     AssemblyCulture
    -1.00
     behaviours
    -0.96
    behaviors
    -0.95
    haviours
    -0.93
    POSITIVE LOGITS
    ing
    0.70
    an
    0.65
     the
    0.61
    ed
    0.55
     lining
    0.51
    er
    0.49
    пад
    0.47
    :");
    
    0.43
     Ass
    0.43
     effect
    0.43
    Act Density 0.232%

    No Known Activations