INDEX
    Explanations

    instances of the start of a document or significant breakpoints in text

    New Auto-Interp
    Negative Logits
     autorytatywna
    -1.37
    IUrlHelper
    -1.16
    Autoritní
    -1.08
     mergeFrom
    -1.04
    rungsseite
    -0.98
    afficheront
    -0.97
     مرئيه
    -0.95
     Савезне
    -0.94
     newBuilder
    -0.94
     Biôgrafia
    -0.93
    POSITIVE LOGITS
    ↵↵
    1.02
    0.83
    ↵↵↵
    0.69
    0.62
    ↵↵↵↵
    0.62
    [toxicity=0]
    0.53
    <eos>
    0.51
    .
    0.50
    ↵↵↵↵↵↵
    0.47
     parem
    0.45
    Act Density 0.149%

    No Known Activations