INDEX
Explanations
references to the English language and related terms
New Auto-Interp
Negative Logits
ici
-0.17
ulence
-0.16
izon
-0.15
ally
-0.14
aper
-0.14
ickle
-0.14
ethe
-0.14
nda
-0.14
Frontier
-0.14
tura
-0.14
POSITIVE LOGITS
-speaking
0.21
-language
0.17
enment
0.17
man
0.17
ning
0.15
ALT
0.15
abez
0.15
ered
0.15
0.14
women
0.14
Activations Density 0.032%