INDEX
Explanations
phrases related to scams
instances of the word "spam."
New Auto-Interp
Negative Logits
REDACTED
-0.68
Vernon
-0.67
dimensions
-0.64
tremend
-0.63
Guinness
-0.62
spirits
-0.62
Sketch
-0.60
presence
-0.59
é»Ĵ
-0.59
Spirits
-0.59
POSITIVE LOGITS
pling
1.13
sterdam
1.04
nesty
1.03
ilial
1.00
ilies
0.99
ulet
0.95
azing
0.95
bitious
0.94
utation
0.93
essage
0.93
Activations Density 0.016%