INDEX
Explanations
words indicating emotional states or reflections on relationships
New Auto-Interp
Negative Logits
either
-0.23
either
-0.20
Either
-0.20
instead
-0.17
asa
-0.17
Either
-0.17
ither
-0.17
377
-0.16
645
-0.15
205
-0.15
POSITIVE LOGITS
ãģĿãģĹãģ¦
0.21
ãģĬãĤĪãģ³
0.20
åıĬ
0.19
以åıĬ
0.18
AND
0.18
åıĬãģ³
0.17
lẫn
0.17
ä¹ĥ
0.17
åıĬ
0.17
ë°ı
0.17
Activations Density 0.021%