INDEX
    Explanations

    don't just blindly rely

    New Auto-Interp
    Negative Logits
    没有什么
    0.46
     purposefully
    0.41
     intentionally
    0.40
     precau
    0.39
    🤨
    0.38
     provoqu
    0.38
     expres
    0.37
     unmistak
    0.37
     volut
    0.37
     deliberately
    0.36
    POSITIVE LOGITS
     blindly
    1.21
     rely
    1.11
     relying
    1.07
     Rely
    0.99
     solely
    0.98
     reliance
    0.90
    reliance
    0.86
     relied
    0.85
     blind
    0.80
     relies
    0.80
    Act Density 0.118%

    No Known Activations