INDEX
    Explanations

    references to religious texts and figures

    New Auto-Interp
    Negative Logits
    ?s
    -0.20
    �s
    -0.19
    ÂŃs
    -0.17
    Âĸ
    -0.17
    ÂĹ
    -0.17
    >NN
    -0.16
    $s
    -0.15
    ÂŃn
    -0.15
    ÂŃt
    -0.15
     �
    -0.15
    POSITIVE LOGITS
    0.45
    '
    0.44
    Ê
    0.41
    ÑĬ
    0.30
    0.30
    `
    0.29
    Ь
    0.28
    '\
    0.27
    â̲
    0.26
    ÑĮ
    0.26
    Act Density 0.132%

    No Known Activations