INDEX
    Explanations

    special characters or patterns

    patterns of content related to safety and security issues

    New Auto-Interp
    Negative Logits
     illusion
    -0.76
     footing
    -0.71
     charm
    -0.68
     whine
    -0.65
     immortal
    -0.65
     laughter
    -0.65
    tons
    -0.64
     wiser
    -0.63
     ageing
    -0.62
     empt
    -0.62
    POSITIVE LOGITS
    Additionally
    1.03
    Furthermore
    1.00
    ³³³³
    0.95
    Conclusion
    0.94
    ³³³³³³³³
    0.94
    ³³³³³³³³³³³³³³³³
    0.92
    Nevertheless
    0.90
    Regardless
    0.89
    Moreover
    0.88
    Nonetheless
    0.87
    Act Density 0.394%

    No Known Activations