INDEX
    Explanations

    references to deception, illusion, and misleading narratives

    New Auto-Interp
    Negative Logits
    ensch
    -0.16
    alic
    -0.16
    WISE
    -0.15
    ãĤ¥
    -0.15
     ç±
    -0.15
    cess
    -0.15
     ØŃÙĪ
    -0.14
    "default
    -0.14
    Writes
    -0.14
     chamber
    -0.14
    POSITIVE LOGITS
    rzy
    0.17
    urb
    0.16
     char
    0.15
    orraine
    0.15
    izard
    0.15
    illusion
    0.15
    ünst
    0.14
    itzer
    0.14
    chet
    0.14
    elden
    0.14
    Act Density 0.112%

    No Known Activations