INDEX
Explanations
phrases emphasizing relationships and connections between individuals or groups
New Auto-Interp
Negative Logits
itſelf
-1.13
ſtate
-1.02
themſelves
-0.99
pleaſure
-0.98
Jefus
-0.97
myſelf
-0.97
ſelf
-0.96
Majefty
-0.92
occaf
-0.92
Efq
-0.90
POSITIVE LOGITS
two
1.10
two
1.09
Two
1.05
Two
1.01
TWO
1.00
TWO
1.00
שני
0.95
zwei
0.95
deux
0.89
δύο
0.89
Activations Density 0.086%