INDEX
    Explanations

    phrases indicating simplicity and goodness in various contexts

    New Auto-Interp
    Negative Logits
    odel
    -0.18
    wiki
    -0.13
     Zwe
    -0.13
    cop
    -0.13
    rex
    -0.13
     ãĥľ
    -0.13
     wikipedia
    -0.13
    ãĥĥãĤ·ãĥ¥
    -0.13
     Lesser
    -0.13
    ÙĪÙĦا
    -0.13
    POSITIVE LOGITS
     honest
    0.18
     understanding
    0.17
    romise
    0.17
     Honest
    0.17
    atro
    0.17
    è¯ļ
    0.16
     God
    0.15
     understand
    0.15
     Gentle
    0.15
     honesty
    0.15
    Act Density 0.114%

    No Known Activations