INDEX
    Explanations

    terms and phrases related to deception or manipulation

    New Auto-Interp
    Negative Logits
     adapta
    -0.72
     inspira
    -0.67
     сделали
    -0.65
     нашли
    -0.63
     делают
    -0.62
     orienta
    -0.62
     coinciden
    -0.62
     representa
    -0.61
     interpreta
    -0.61
     combina
    -0.60
    POSITIVE LOGITS
     poffe
    0.85
     raiſ
    0.80
    MethodManager
    0.74
     atsi
    0.73
     deſt
    0.72
    AndEndTag
    0.70
    %";
    0.69
    cknow
    0.68
    etzal
    0.68
     herum
    0.67
    Act Density 1.154%

    No Known Activations