INDEX
    Explanations

    phrases that involve labeling people or concepts as 'fake', 'dangerous', or derogatory, often utilizing quotes to emphasize these characterizations

    New Auto-Interp
    Negative Logits
     HasFactory
    -0.66
    دانشنامهٔ
    -0.62
     ſtand
    -0.58
     autorytatywna
    -0.58
     raiſ
    -0.56
     nahilalakip
    -0.52
     houſe
    -0.51
    -0.50
    曖昧さ回避
    -0.50
    httphttps
    -0.50
    POSITIVE LOGITS
    dcterms
    0.44
    fluoro
    0.44
    RestTemplate
    0.41
    賀状
    0.36
    ">:
    0.36
     gefü
    0.35
    Коммента
    0.35
     epit
    0.35
    unoz
    0.34
     Kün
    0.34
    Act Density 0.605%

    No Known Activations