INDEX
    Explanations

    punctuation marks and certain high-frequency function words

    New Auto-Interp
    Negative Logits
    ãģıãĤĮ
    -0.17
    arel
    -0.15
    oga
    -0.15
    marvin
    -0.15
    urge
    -0.15
    von
    -0.14
    inen
    -0.14
    ãĤ¤ãĥ«
    -0.14
    controllers
    -0.14
    績
    -0.14
    POSITIVE LOGITS
    änn
    0.15
     пÑĢид
    0.15
    omik
    0.14
    oppins
    0.14
    emin
    0.14
    κÏģα
    0.14
    udden
    0.14
     زر
    0.14
    erval
    0.13
    è͵
    0.13
    Act Density 0.002%

    No Known Activations