INDEX
    Explanations

    academic disciplines and theories

    logical reasoning and philosophers

    New Auto-Interp
    Negative Logits
    I
    0.88
    theta
    0.61
    au
    0.59
    s
    0.59
    ong
    0.57
    ных
    0.57
    triggers
    0.57
    enn
    0.57
    telling
    0.57
    de
    0.57
    POSITIVE LOGITS
    0.65
     elementi
    0.61
    ר
    0.60
    ک
    0.58
    0.58
     imati
    0.57
     analisi
    0.56
     avevo
    0.56
     edifici
    0.56
     объ
    0.56
    Act Density 0.802%

    No Known Activations