INDEX
Explanations
instances of the word "h" and variations with different activation values
New Auto-Interp
Negative Logits
Eternity
-0.71
mosqu
-0.65
ducks
-0.64
Bots
-0.64
conclud
-0.64
destro
-0.61
eering
-0.61
wedge
-0.60
heartbeat
-0.60
Clicker
-0.60
POSITIVE LOGITS
ulhu
1.24
orses
1.10
arma
1.05
atever
1.03
agen
1.03
arel
1.03
ilar
1.02
ollow
0.97
airy
0.95
onest
0.94
Activations Density 0.011%