INDEX
Explanations
advertisements within text, as indicated by the consistent high activations for the word "Advertisement."
various occurrences of advertisements
New Auto-Interp
Negative Logits
wcs
-0.71
manif
-0.66
nonviolent
-0.64
integrity
-0.62
servicing
-0.61
overcoming
-0.61
bip
-0.59
vert
-0.59
thrill
-0.54
gra
-0.53
POSITIVE LOGITS
theless
1.00
Advertisement
0.89
itto
0.65
olicy
0.64
RFC
0.64
}}
0.62
acters
0.62
ulhu
0.59
Reese
0.59
istani
0.59
Activations Density 0.030%