INDEX
Explanations
phrases containing the word "Bet" and associated words
references to betting or gambling activities
New Auto-Interp
Negative Logits
ĸļ
-0.99
anwhile
-0.82
obser
-0.77
eclipse
-0.74
VILLE
-0.74
allery
-0.69
artifacts
-0.67
士
-0.66
OPLE
-0.66
SPONSORED
-0.65
POSITIVE LOGITS
ting
1.30
hesda
1.15
ray
1.06
rol
1.01
lehem
0.98
tery
0.96
ters
0.95
terness
0.91
rom
0.90
ron
0.90
Activations Density 0.009%