INDEX
    Explanations

    parentheses and asterisks

    New Auto-Interp
    Negative Logits
    }`,
    1.00
    \",
    0.97
    :",
    0.97
    ...",
    0.96
    ;",
    0.93
     \...
    0.92
    $",
    0.92
    0.91
    ",
    0.91
     эмне
    0.89
    POSITIVE LOGITS
    ↵↵↵↵
    1.90
    ↵↵↵↵↵
    1.78
     Note
    1.76
    ↵↵↵↵↵↵↵
    1.67
    ↵↵↵
    1.66
    ↵↵↵↵↵↵
    1.66
     NOTE
    1.58
    ↵↵↵↵↵↵↵↵↵
    1.46
     Interestingly
    1.43
     Bonus
    1.42
    Act Density 1.273%

    No Known Activations