INDEX
    Explanations

    phrases expressing moral or ethical correctness

    New Auto-Interp
    Negative Logits
    TestMethod
    -0.14
    RefPtr
    -0.14
    ÑĨей
    -0.14
    nds
    -0.14
    åŀĭ
    -0.13
    cio
    -0.13
    utters
    -0.13
    ä¸ĢæŃ¥
    -0.13
    оиÑĤ
    -0.13
    ê´Ģ리ìŀIJ
    -0.13
    POSITIVE LOGITS
     thing
    0.98
     things
    0.85
     Thing
    0.80
    thing
    0.77
    Thing
    0.71
     Things
    0.69
    Things
    0.66
    things
    0.66
     cosas
    0.58
     cosa
    0.57
    Act Density 0.242%

    No Known Activations