INDEX
    Explanations

    phrases related to user rights and content moderation

    New Auto-Interp
    Negative Logits
    fjspx
    -0.67
    -0.44
    OGND
    -0.43
     Pick
    -0.40
     Dil
    -0.40
    -0.40
     nakalista
    -0.39
    COMPAR
    -0.38
     kend
    -0.38
    тельству
    -0.38
    POSITIVE LOGITS
     alebo
    0.47
     ogrodow
    0.47
     typelib
    0.47
     zupeł
    0.46
     singola
    0.45
     Komunikasi
    0.44
     individuale
    0.44
     or
    0.44
     seduta
    0.43
     creș
    0.43
    Act Density 0.024%

    No Known Activations