INDEX
    Explanations

    phrases or terms related to different types of characters or personal identities

    New Auto-Interp
    Negative Logits
    ÑĬ
    -0.24
    ÑĮÑı
    -0.24
    ìľ¼ë¡ľ
    -0.21
    i
    -0.20
    Ь
    -0.20
    ÑĮÑİ
    -0.20
    ам
    -0.19
    ами
    -0.19
    ом
    -0.18
    ÑĮе
    -0.18
    POSITIVE LOGITS
    нка
    0.31
    й
    0.29
    нд
    0.27
    Ìģ
    0.27
    нки
    0.27
    м
    0.26
    н
    0.26
    нг
    0.25
    лÑĮ
    0.24
    нÑĤ
    0.23
    Act Density 0.041%

    No Known Activations