INDEX
    Explanations

    words or symbols related to strong emotional expressions or reactions

    New Auto-Interp
    Negative Logits
    xual
    -0.74
    ngth
    -0.68
    jri
    -0.68
    wark
    -0.65
    wagen
    -0.62
     laund
    -0.62
    WARD
    -0.61
     Sapphire
    -0.61
     Butterfly
    -0.59
     Seym
    -0.59
    POSITIVE LOGITS
    ļ
    1.29
    Ĺ
    1.24
    ij
    1.24
    ŀ
    1.23
    Ģ
    1.20
    «
    1.15
    ĥ
    1.15
    Ī
    1.14
    Ĩ
    1.12
    ĺ
    1.11
    Act Density 0.002%

    No Known Activations