INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    kefeller
    -0.79
    undo
    -0.79
    ype
    -0.76
    enhagen
    -0.72
    ypes
    -0.72
    raltar
    -0.71
     sidx
    -0.71
    ozy
    -0.69
    vernment
    -0.69
    emetery
    -0.68
    POSITIVE LOGITS
     herself
    1.71
    pher
    1.15
     Louise
    1.12
     Anne
    1.10
     Marie
    1.09
     husband
    1.07
     maid
    1.04
     hijab
    1.03
     vagina
    1.03
    athed
    1.03
    Act Density 3.575%

    No Known Activations