INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ZH
    -0.07
    스트
    -0.07
     Assertions
    -0.07
     Nd
    -0.07
    드리
    -0.07
    Born
    -0.07
    лок
    -0.07
    WITHOUT
    -0.06
    lical
    -0.06
    Alignment
    -0.06
    POSITIVE LOGITS
    lvl
    0.06
    	event
    0.06
    -produ
    0.06
     θέ
    0.06
    ‌شوند
    0.05
     inflamm
    0.05
    567
    0.05
     України
    0.05
    иль
    0.05
    ैसल
    0.05
    Act Density 0.018%

    No Known Activations