INDEX
Explanations
phrases indicating reference or direction
New Auto-Interp
Negative Logits
ff
-0.07
ck
-0.07
ament
-0.06
ez
-0.06
al
-0.06
oundary
-0.06
today
-0.06
nobody
-0.06
Brit
-0.05
irit
-0.05
POSITIVE LOGITS
erview
0.08
ouncil
0.08
æĻĵ
0.07
engin
0.07
enville
0.07
illard
0.07
ÄŁÃ¼
0.07
abama
0.07
ACHER
0.07
çŁ
0.07
Activations Density 0.002%