INDEX
    Explanations

    references to the concept of hypocrisy

    New Auto-Interp
    Negative Logits
    rig
    -0.17
    orns
    -0.16
     Kaplan
    -0.15
    اÙ쨹
    -0.15
    礼
    -0.15
    iffs
    -0.15
     Lair
    -0.15
    uge
    -0.15
    yo
    -0.14
    oons
    -0.14
    POSITIVE LOGITS
    .dy
    0.15
    머ëĭĪ
    0.14
    ody
    0.14
    ικο
    0.14
    .sy
    0.14
     Hass
    0.14
     slee
    0.13
     OnTrigger
    0.13
    ÄĽÅ¾
    0.13
    lop
    0.13
    Act Density 0.024%

    No Known Activations