INDEX
Explanations
the presence of specific letter combinations or patterns, particularly those starting with "th," "oth," or containing repeated sequences
New Auto-Interp
Negative Logits
ez
-0.22
er
-0.19
ER
-0.19
ease
-0.18
ech
-0.18
ee
-0.17
eh
-0.17
tero
-0.16
ei
-0.16
wner
-0.16
POSITIVE LOGITS
ttp
0.30
entication
0.27
ematics
0.24
ompson
0.23
edral
0.23
aniel
0.23
ousand
0.22
odoxy
0.22
ousands
0.22
à¥įà¤ł
0.22
Activations Density 0.089%