INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     immor
    0.38
     prudence
    0.37
     empath
    0.35
     mivel
    0.35
     vollständ
    0.35
     immoral
    0.34
     veracity
    0.34
     asesin
    0.34
     liberté
    0.33
     koska
    0.33
    POSITIVE LOGITS
    able
    0.50
    ing
    0.47
    -
    0.47
    izing
    0.46
    적인
    0.43
    ized
    0.43
    ization
    0.41
    化的
    0.41
    式的
    0.40
    ification
    0.40
    Act Density 0.923%

    No Known Activations