INDEX
Explanations
repeated mentions of the word "each"
New Auto-Interp
Negative Logits
SUT
-0.71
er
-0.66
Goy
-0.66
Klo
-0.62
Cof
-0.60
Bons
-0.60
Lol
-0.59
Lol
-0.58
buster
-0.58
able
-0.57
POSITIVE LOGITS
EACH
1.50
EACH
1.36
each
1.24
each
1.21
Each
1.19
Each
1.15
BeforeEach
1.14
각
1.11
masing
1.10
Chaque
1.07
Activations Density 0.065%