INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    .):
    1.52
    ):
    1.40
    ]:
    1.34
    \":
    1.34
    }$:
    1.32
    .]:
    1.32
    ":
    1.31
    ()):
    1.26
    "):
    1.25
    »:
    1.21
    POSITIVE LOGITS
    3.09
    ↵↵
    2.42
    ↵↵↵
    2.06
    ↵↵↵↵
    1.74
    ↵↵↵↵↵
    1.69
     \\
    1.54
    1.47
    </li>
    1.46
    ↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵
    1.45
    ↵↵↵↵↵↵↵
    1.40
    Act Density 4.608%

    No Known Activations