INDEX
    Explanations

    references to toxic substances and their effects

    New Auto-Interp
    Negative Logits
    ufs
    -0.14
    ijken
    -0.14
    nego
    -0.14
     herpes
    -0.14
     sond
    -0.14
    .bias
    -0.13
     precious
    -0.13
    ụ
    -0.13
     neut
    -0.13
    enefit
    -0.13
    POSITIVE LOGITS
     poisoning
    0.43
     poison
    0.43
     poisonous
    0.42
     Poison
    0.42
     toxic
    0.38
     poisoned
    0.35
     toxicity
    0.35
    æ¯Ĵ
    0.34
     Toxic
    0.34
     toxins
    0.32
    Act Density 0.103%

    No Known Activations