INDEX
Explanations
email addresses in text
mentions of email addresses or social media handles
New Auto-Interp
Negative Logits
Catal
-0.64
gambling
-0.60
acqu
-0.58
reward
-0.58
torn
-0.57
analges
-0.57
Ninth
-0.56
attempt
-0.56
decl
-0.56
Kidd
-0.55
POSITIVE LOGITS
@
4.06
@
2.04
"@
1.84
@#
1.66
(@
1.49
#
1.27
://
1.26
=#
1.14
@@
1.11
1.04
Activations Density 0.013%