INDEX
Explanations
phrases with the word "two"
references to the number two
New Auto-Interp
Negative Logits
ugu
-0.77
asta
-0.75
atown
-0.72
iggins
-0.68
annel
-0.68
ubs
-0.68
ushima
-0.67
yz
-0.66
renheit
-0.66
ovi
-0.66
POSITIVE LOGITS
halves
1.23
thirds
1.19
fold
1.02
sexes
0.95
sides
0.89
dozen
0.88
teenth
0.85
Kore
0.84
main
0.80
hundred
0.79
Activations Density 0.054%