INDEX
    Explanations

    phrases that express comparison or similarity

    New Auto-Interp
    Negative Logits
    iman
    -0.15
    åłĤ
    -0.15
     aktu
    -0.15
    _SAFE
    -0.14
    obia
    -0.14
    iah
    -0.14
    åĩ½
    -0.14
    either
    -0.13
    ter
    -0.13
    nt
    -0.13
    POSITIVE LOGITS
     many
    0.27
     other
    0.23
     any
    0.23
     most
    0.22
    many
    0.21
    许å¤ļ
    0.21
     elsewhere
    0.21
    åħ¶ä»ĸ
    0.17
    ä»»ä½ķ
    0.17
     everywhere
    0.17
    Act Density 0.051%

    No Known Activations