INDEX
    Explanations

    instances of claims or statements related to societal issues

    New Auto-Interp
    Negative Logits
    ÌĨ
    -0.16
    elif
    -0.15
    (strtolower
    -0.14
     espec
    -0.14
    ãĥ©ãĥĥãĤ¯
    -0.14
    ÙĦÛĮس
    -0.13
    ADM
    -0.13
    uden
    -0.13
    ç¨ĭ度
    -0.13
    eps
    -0.12
    POSITIVE LOGITS
     means
    0.81
     Means
    0.72
    means
    0.68
    Means
    0.65
     meaning
    0.63
    meaning
    0.59
     mean
    0.55
    æĦıåij³
    0.53
    Mean
    0.52
     Meaning
    0.52
    Act Density 0.285%

    No Known Activations