INDEX
    Explanations

    physical environment, action, or harm

    New Auto-Interp
    Negative Logits
    ình
    0.74
    8
    0.72
    ”)
    0.71
    ٦
    0.71
    )$.
    0.70
    minutes
    0.70
    thirty
    0.70
    versation
    0.69
    prisoners
    0.69
    loved
    0.68
    POSITIVE LOGITS
     physical
    1.20
     fisik
    1.05
     physically
    1.02
     físico
    0.98
    ع
    0.96
     
    0.95
     fís
    0.88
     física
    0.80
     физи
    0.77
     fysis
    0.77
    Act Density 0.019%

    No Known Activations