INDEX
Explanations
starts of titles or phrases
New Auto-Interp
Negative Logits
новый
-1.09
новой
-1.00
nowy
-0.98
quarantined
-0.94
nový
-0.94
الخاص
-0.94
czerwony
-0.94
phê
-0.93
옐
-0.93
ového
-0.93
POSITIVE LOGITS
Only
1.36
They
1.23
That
1.05
Those
1.04
Just
1.02
%',
1.02
Most
0.99
As
0.98
Both
0.98
Some
0.97
Activations Density 0.050%