INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    \",
    1.13
    .",
    1.11
    .',
    0.98
    :",
    0.96
     \"
    0.96
    \"",
    0.95
    !”,
    0.94
    }",
    0.93
    。",
    0.92
    !",
    0.90
    POSITIVE LOGITS
    ↵↵↵
    3.28
    ↵↵↵↵
    3.03
    ↵↵↵↵↵
    2.70
    ↵↵↵↵↵↵
    2.46
    ↵↵↵↵↵↵↵
    2.45
    ↵↵↵↵↵↵↵↵↵
    2.38
    ↵↵↵↵↵↵↵↵
    2.23
    ↵↵↵↵↵↵↵↵↵↵
    2.08
    ↵↵↵↵↵↵↵↵↵↵↵
    2.08
    ↵↵↵↵↵↵↵↵↵↵↵↵↵
    2.05
    Act Density 2.184%

    No Known Activations