INDEX
    Explanations

    lists of items or qualities

    New Auto-Interp
    Negative Logits
     אני
    0.40
    𝖐
    0.39
    习惯
    0.38
    andar
    0.37
     заклад
    0.37
     سمیت
    0.37
    0.37
     tasmim
    0.37
    학과
    0.36
    地区的
    0.36
    POSITIVE LOGITS
     suggestion
    0.41
    fairy
    0.41
    Tea
    0.39
     flow
    0.38
     wing
    0.38
    tea
    0.37
     happy
    0.37
     stethoscope
    0.37
    ítő
    0.37
    ugel
    0.36
    Act Density 0.000%

    No Known Activations