INDEX
    Explanations

    phrases indicating skepticism or criticism towards institutional practices and beliefs

    New Auto-Interp
    Negative Logits
     (
    -0.98
    -0.67
     الحره
    -0.67
    .(
    -0.63
    </h4>
    -0.61
    .*")]
    -0.57
    .
    -0.57
     (
    -0.54
    。(
    -0.54
    Hentet
    -0.52
    POSITIVE LOGITS
    ?),
    1.65
    ?).
    1.63
    !),
    1.60
    !).
    1.56
    ),”
    1.44
    ).</
    1.41
    !)
    1.39
    ?)
    1.38
    )”.
    1.38
    )".
    1.33
    Act Density 1.044%

    No Known Activations