INDEX
    Explanations

    phrases that indicate reasoning or justification

    New Auto-Interp
    Negative Logits
    allas
    -0.16
     Josh
    -0.15
    esto
    -0.14
    mond
    -0.14
    lication
    -0.14
     Ske
    -0.14
     Mind
    -0.14
    uche
    -0.14
     beef
    -0.13
    Josh
    -0.13
    POSITIVE LOGITS
    озем
    0.16
    ikon
    0.15
    ihad
    0.15
    ovny
    0.14
    ople
    0.14
    /Instruction
    0.14
    apolis
    0.14
    ãĥ³ãĥĹ
    0.14
    .partial
    0.14
    ssel
    0.14
    Act Density 0.146%

    No Known Activations