INDEX
    Explanations
    No Explanations Found
    New Auto-Interp
    Negative Logits
     welf
    -0.74
     Dept
    -0.67
     behav
    -0.65
     Compare
    -0.64
     Cf
    -0.64
    quartered
    -0.64
     regul
    -0.63
    galitarian
    -0.61
     helic
    -0.60
     è£ıè
    -0.60
    POSITIVE LOGITS
    SN
    0.74
    NL
    0.70
    ability
    0.68
    ola
    0.66
    offensive
    0.65
    net
    0.64
    ical
    0.63
    livion
    0.63
    ader
    0.63
    atta
    0.62
    Act Density 0.000%

    No Known Activations

    This feature has no known activations.