INDEX
    Explanations

    phrases related to reasons and causation in various contexts

    New Auto-Interp
    Negative Logits
    isa
    -0.17
    pires
    -0.16
     Trends
    -0.14
    kenin
    -0.14
     Formal
    -0.14
    tz
    -0.13
     Surveillance
    -0.13
     kez
    -0.13
     Nem
    -0.13
     Worst
    -0.13
    POSITIVE LOGITS
     concerns
    0.21
     safety
    0.20
     technical
    0.19
     lack
    0.18
     concern
    0.18
     too
    0.16
    Safety
    0.16
     Safety
    0.15
     objections
    0.15
     saf
    0.15
    Act Density 0.116%

    No Known Activations