INDEX
    Explanations

    model refusalsmodel refusalsmodel refusalsmodel refusalmodel speakingmodel outputmodel outputmodel outputmodel outputmodel outputmodel speakingmodel outputmodel speaking

    New Auto-Interp
    Negative Logits
     KMnO
    0.41
    0.40
    Occ
    0.38
     occ
    0.36
     khấu
    0.36
     निशान
    0.35
     insinu
    0.35
    Locks
    0.35
     vistazo
    0.34
    રો
    0.34
    POSITIVE LOGITS
     Кан
    0.39
     Dan
    0.39
     Edel
    0.39
    0.38
     weiterhin
    0.37
     Danh
    0.37
    ভালো
    0.37
    anch
    0.37
     Angelo
    0.37
    ajax
    0.36
    Act Density 0.050%

    No Known Activations