INDEX
    Explanations

    phrases starting with "Our"

    instances of the word "Our."

    New Auto-Interp
    Negative Logits
    CENT
    -0.66
     ``(
    -0.66
    liest
    -0.64
    ambers
    -0.64
     LSD
    -0.63
    quote
    -0.61
     externalToEVAOnly
    -0.61
    -0.61
    cum
    -0.60
    conom
    -0.60
    POSITIVE LOGITS
    selves
    1.44
     own
    0.99
    ¥ŀ
    0.98
    self
    0.93
    anmar
    0.92
     adversary
    0.83
    cyclopedia
    0.83
     adversaries
    0.82
     ourselves
    0.78
     fearless
    0.77
    Act Density 0.049%

    No Known Activations