INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Clifford
    -0.07
    (prog
    -0.07
    ưng
    -0.07
    *.
    -0.07
     Four
    -0.06
     Kem
    -0.06
    -0.06
     militar
    -0.06
    Flow
    -0.06
     liberation
    -0.06
    POSITIVE LOGITS
     honest
    0.19
     honesty
    0.14
     Honest
    0.11
     honestly
    0.09
     truthful
    0.08
     dishonest
    0.08
    onest
    0.07
     Lös
    0.07
    Honestly
    0.07
     HK
    0.07
    Act Density 0.006%

    No Known Activations