INDEX
    Explanations

    statements that emphasize factuality or certainty

    New Auto-Interp
    Negative Logits
     Rosetta
    -0.65
    دين
    -0.54
     Ohne
    -0.54
    }`}>
    -0.54
     unknowns
    -0.54
     underestimated
    -0.54
     insensible
    -0.53
    DIS
    -0.53
    }}],
    -0.53
     Boa
    -0.52
    POSITIVE LOGITS
     fact
    1.12
     indeed
    1.07
    indeed
    1.03
    Indeed
    0.91
     Indeed
    0.90
    事实上
    0.88
    IntoConstraints
    0.86
    fact
    0.84
    Bahkan
    0.82
     Bahkan
    0.81
    Act Density 0.115%

    No Known Activations