INDEX
Explanations
special characters or patterns
patterns of content related to safety and security issues
New Auto-Interp
Negative Logits
illusion
-0.76
footing
-0.71
charm
-0.68
whine
-0.65
immortal
-0.65
laughter
-0.65
tons
-0.64
wiser
-0.63
ageing
-0.62
empt
-0.62
POSITIVE LOGITS
Additionally
1.03
Furthermore
1.00
³³³³
0.95
Conclusion
0.94
³³³³³³³³
0.94
³³³³³³³³³³³³³³³³
0.92
Nevertheless
0.90
Regardless
0.89
Moreover
0.88
Nonetheless
0.87
Activations Density 0.394%