INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Temper
    -0.08
    hältnis
    -0.08
     square
    -0.07
    -0.07
     daughters
    -0.07
     relax
    -0.07
    temper
    -0.07
    wür
    -0.07
    ca
    -0.07
     territori
    -0.07
    POSITIVE LOGITS
     selective
    0.14
     selet
    0.12
    Selective
    0.12
     selectively
    0.11
     Filtering
    0.11
     Filter
    0.11
    .Filter
    0.11
    (Filter
    0.11
    过滤
    0.10
     filtro
    0.10
    Act Density 0.011%

    No Known Activations