INDEX
    Explanations

    statements that the neuron perceives to be true or accurate

    phrases asserting the truthfulness of statements

    New Auto-Interp
    Negative Logits
    uled
    -0.75
    adish
    -0.72
    rador
    -0.71
    acent
    -0.68
    aida
    -0.68
    hens
    -0.68
    asers
    -0.68
    ADRA
    -0.68
    onut
    -0.66
     Citiz
    -0.65
    POSITIVE LOGITS
     believers
    0.86
    hood
    0.84
     regardless
    0.78
     believer
    0.76
     insofar
    0.72
     irrespective
    0.70
     portrayal
    0.69
    terday
    0.68
    izable
    0.68
     everywhere
    0.68
    Act Density 0.023%

    No Known Activations