INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    '
    -0.54
     femininas
    -0.51
     skydd
    -0.51
     afectadas
    -0.47
    Enders
    -0.47
     volgt
    -0.47
    works
    -0.47
     oscura
    -0.46
     falsas
    -0.46
     debía
    -0.46
    POSITIVE LOGITS
    QUENCE
    0.76
     the
    0.73
    Aholisi
    0.69
    Eksterne
    0.68
     Préférences
    0.68
     redistribute
    0.65
    Hauptartikel
    0.65
     their
    0.64
    wixt
    0.64
     simulate
    0.63
    Act Density 0.080%

    No Known Activations