INDEX
    Explanations

    references to specific brands or products, especially in a negative context

    New Auto-Interp
    Negative Logits
    ÑĪÑĤ
    -0.18
    ctors
    -0.15
    ERO
    -0.15
    ÑģÑĤиÑĤ
    -0.15
    noteq
    -0.15
    cloak
    -0.15
    ÑģÑĤÑĢа
    -0.15
    sher
    -0.15
    amat
    -0.14
    ξι
    -0.14
    POSITIVE LOGITS
     Hy
    0.18
     pol
    0.17
    Pol
    0.17
     Pol
    0.16
     hy
    0.15
     hydr
    0.15
    -pol
    0.15
     Desired
    0.14
    iesel
    0.14
    ¬
    0.14
    Act Density 0.025%

    No Known Activations