INDEX
Explanations
terms associated with knowledge or acknowledgment
New Auto-Interp
Negative Logits
ritt
-0.16
rana
-0.15
orgh
-0.15
NCY
-0.14
GLOSS
-0.14
itler
-0.14
riba
-0.14
olina
-0.14
elerinden
-0.13
PIO
-0.13
POSITIVE LOGITS
simply
0.26
collo
0.24
inform
0.24
s
0.23
col
0.22
popular
0.21
affection
0.20
loving
0.20
gener
0.19
various
0.19
Activations Density 0.028%