INDEX
    Explanations

    instances of inclusive language and community references

    New Auto-Interp
    Negative Logits
    Injection
    -0.15
    ãģĹãĤĥ
    -0.15
    ÑĨип
    -0.15
    ãĥ³ãĥĩãĤ£
    -0.15
    ikan
    -0.14
     injected
    -0.14
    pd
    -0.14
    slt
    -0.14
    åī¯
    -0.14
    acie
    -0.14
    POSITIVE LOGITS
    nof
    0.15
    ammo
    0.14
    vre
    0.14
    .Aggressive
    0.14
    erti
    0.14
    _Handle
    0.14
    yntax
    0.14
    zy
    0.13
     Tah
    0.13
    anger
    0.13
    Act Density 0.208%

    No Known Activations