INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     allegiance
    0.85
     disbelief
    0.83
     Thoughts
    0.75
     thoughts
    0.73
     appreciation
    0.69
     pensamiento
    0.68
     instinct
    0.68
     understanding
    0.67
     probably
    0.66
     curiosity
    0.66
    POSITIVE LOGITS
     know
    0.95
    Know
    0.91
     Know
    0.91
    know
    0.89
     KNOW
    0.79
     knows
    0.77
    知道
    0.73
     anses
    0.72
    著名的
    0.71
    Known
    0.71
    Act Density 0.073%

    No Known Activations