INDEX
    Explanations

    elements related to novelty and new experiences

    New Auto-Interp
    Negative Logits
    bot
    -0.14
    æ°ĹãģĮ
    -0.14
     Anders
    -0.14
    ека
    -0.14
    asta
    -0.13
    ente
    -0.13
    014
    -0.13
    ¤í
    -0.13
    617
    -0.13
    Initialization
    -0.13
    POSITIVE LOGITS
     never
    0.61
    never
    0.54
     Never
    0.52
    Never
    0.48
     nunca
    0.47
     NEVER
    0.45
     hadn
    0.40
     никогда
    0.40
     haven
    0.39
     jamais
    0.38
    Act Density 0.220%

    No Known Activations