INDEX
    Explanations

    adjectives representing intensity or scale

    concepts related to potential outcomes or consequences

    New Auto-Interp
    Negative Logits
    nar
    -0.74
    antic
    -0.70
    stice
    -0.70
     Centauri
    -0.69
    scope
    -0.67
    Canadian
    -0.66
    Å
    -0.63
    ses
    -0.63
    ashi
    -0.63
    anc
    -0.63
    POSITIVE LOGITS
     WHEN
    1.36
     whenever
    1.08
     if
    1.08
     when
    1.04
     unless
    0.96
     BEFORE
    0.92
    when
    0.90
     WHERE
    0.85
     When
    0.84
     AFTER
    0.84
    Act Density 0.177%

    No Known Activations