INDEX
    Explanations

    references to the word "ant" and its variations, indicating a specific focus on that term in different contexts

    New Auto-Interp
    Negative Logits
    rl
    -0.17
    rig
    -0.17
    rint
    -0.16
    ra
    -0.16
    strup
    -0.16
    ryo
    -0.15
    hib
    -0.15
    ront
    -0.15
    ract
    -0.15
    riz
    -0.15
    POSITIVE LOGITS
    y
    0.23
    ucket
    0.22
    yne
    0.22
    elope
    0.20
    ing
    0.20
    enna
    0.19
    woord
    0.19
    yre
    0.19
    werp
    0.18
    ech
    0.17
    Act Density 0.033%

    No Known Activations