INDEX
    Explanations

    derogatory terms and expressions related to poor behavior or attitudes

    New Auto-Interp
    Negative Logits
    Portail
    -0.58
    !';
    -0.55
    %";
    -0.53
    ]');
    -0.53
    '));
    
    -0.52
    )');
    -0.52
     Helios
    -0.52
    >';
    -0.52
    ?}",
    -0.51
    __);
    -0.50
    POSITIVE LOGITS
     Brat
    2.44
    Brat
    2.28
     brat
    2.03
    brat
    1.66
     Frat
    0.75
    rat
    0.71
    ratt
    0.68
    brata
    0.66
    Frat
    0.66
     bral
    0.63
    Act Density 0.003%

    No Known Activations