INDEX
    Explanations

    words and phrases associated with physical action and their consequences

    New Auto-Interp
    Negative Logits
     aup
    -0.15
    -lfs
    -0.14
    kud
    -0.13
    Ups
    -0.13
    /up
    -0.13
    /down
    -0.13
    رسÛĮ
    -0.13
    aub
    -0.13
    ufs
    -0.12
    اÙģØª
    -0.12
    POSITIVE LOGITS
     out
    1.55
    out
    1.02
    -out
    1.01
    åĩº
    0.94
     Out
    0.90
    (out
    0.84
    _out
    0.83
    Out
    0.81
    	out
    0.80
     OUT
    0.79
    Act Density 1.323%

    No Known Activations