INDEX
Explanations
words and phrases indicating specific structural features or relationships
New Auto-Interp
Negative Logits
Lester
-0.15
usta
-0.14
inski
-0.14
ppy
-0.14
hawks
-0.14
ych
-0.14
monic
-0.14
icont
-0.14
eer
-0.14
innacle
-0.14
POSITIVE LOGITS
ologne
0.16
ngth
0.15
ä¹ĭä¸Ģ
0.14
zeÅĪ
0.14
maz
0.14
noch
0.13
صاØŃ
0.13
olie
0.13
atas
0.13
665
0.13
Activations Density 0.009%