INDEX
    Explanations

    inappropriate

    New Auto-Interp
    Negative Logits
     gol
    -0.06
     Revolution
    -0.06
    щий
    -0.06
    show
    -0.06
    ün
    -0.06
    .Aggressive
    -0.06
     attack
    -0.06
     boarding
    -0.06
     triggered
    -0.06
     AppDelegate
    -0.06
    POSITIVE LOGITS
     Raz
    0.07
    [color
    0.07
    &&
    0.07
    ανα
    0.07
     inappropriate
    0.07
    πως
    0.07
    重複
    0.06
    0.06
     рег
    0.06
    atherine
    0.06
    Act Density 0.009%

    No Known Activations