INDEX
Explanations
nouns associated with positive attributes or evaluations
New Auto-Interp
Negative Logits
AGED
-0.15
Lyon
-0.15
langu
-0.15
qu
-0.14
Gle
-0.14
785
-0.14
inals
-0.14
/be
-0.14
utsch
-0.14
own
-0.14
POSITIVE LOGITS
overall
0.19
overall
0.17
abela
0.15
ucch
0.15
Overall
0.15
anner
0.15
Overall
0.15
ãĥ¥ãĥ¼
0.15
RIORITY
0.15
illis
0.15
Activations Density 0.149%