INDEX
Explanations
phrases indicating importance or significance
New Auto-Interp
Negative Logits
zd
-0.17
rets
-0.17
usercontent
-0.16
inki
-0.15
zers
-0.15
ENDOR
-0.15
ertoire
-0.15
OPSIS
-0.15
OLUMN
-0.14
.hwp
-0.14
POSITIVE LOGITS
acc
0.15
plier
0.15
airo
0.14
arda
0.14
rag
0.14
owi
0.14
null
0.14
acc
0.14
aller
0.13
æ¡IJ
0.13
Activations Density 0.076%