INDEX
    Explanations

    phrases that indicate falsehood or misinformation

    New Auto-Interp
    Negative Logits
    ito
    -0.17
     hypoc
    -0.16
     anders
    -0.14
     pun
    -0.14
    ylon
    -0.14
     Riv
    -0.14
    rick
    -0.14
    illiseconds
    -0.14
     lovers
    -0.13
    empo
    -0.13
    POSITIVE LOGITS
     accuracy
    0.19
    accuracy
    0.19
    ibold
    0.18
     accurate
    0.18
     Accuracy
    0.17
     accur
    0.17
    accur
    0.17
    ÙĪØº
    0.15
    Accuracy
    0.15
     reality
    0.15
    Act Density 0.206%

    No Known Activations