INDEX
    Explanations

    statements related to politics or controversial figures

    phrases related to decision-making and consequences

    New Auto-Interp
    Negative Logits
     âĹ
    -0.83
     âĩ
    -0.80
    ãĤ´ãĥ³
    -0.75
     «
    -0.75
    ¶
    -0.75
    âĦ¢:
    -0.74
    âĹ
    -0.73
     âĸ
    -0.69
    ortium
    -0.69
    ãĥ¯ãĥ³
    -0.68
    POSITIVE LOGITS
    .")
    1.70
    ,'"
    1.63
     ..."
    1.61
    !'"
    1.59
    ?'"
    1.59
    ',"
    1.55
     â̦"
    1.54
    .'"
    1.52
    ..."
    1.45
    )."
    1.39
    Act Density 1.014%

    No Known Activations