INDEX
Explanations
phrases that imply comparison or categorization of concepts
New Auto-Interp
Negative Logits
olo
-0.17
ittel
-0.16
anken
-0.16
orz
-0.15
reon
-0.14
edu
-0.14
org
-0.14
ongs
-0.14
ntag
-0.14
Sloan
-0.14
POSITIVE LOGITS
aggio
0.16
æ¸Ī
0.15
LEV
0.15
Satoshi
0.15
cant
0.15
Cant
0.14
chwitz
0.14
å¶
0.14
ckill
0.14
stery
0.14
Activations Density 0.280%