INDEX
    Explanations

    phrases indicating explanation or justification

    expressions indicating justification or explanation

    New Auto-Interp
    Negative Logits
    Blue
    -0.62
    Dragon
    -0.60
    ModLoader
    -0.60
    TR
    -0.59
    rab
    -0.58
    Enough
    -0.57
    cause
    -0.57
    Tokens
    -0.57
     procedural
    -0.55
    VERSION
    -0.55
    POSITIVE LOGITS
    SPONSORED
    0.75
     alone
    0.72
    士
    0.71
    idents
    0.66
    phas
    0.65
    zik
    0.63
     we
    0.60
    gha
    0.60
    akedown
    0.59
     contrasts
    0.59
    Act Density 0.080%

    No Known Activations