INDEX
Explanations
emphasizing phrases that highlight importance or significance
New Auto-Interp
Negative Logits
ãĥŃãĥ¼
-0.15
itin
-0.15
ares
-0.14
oyo
-0.14
è©
-0.14
ience
-0.14
eros
-0.14
orate
-0.14
alin
-0.14
opol
-0.14
POSITIVE LOGITS
example
0.17
importantly
0.16
ensch
0.16
obvious
0.15
uka
0.15
/example
0.15
çŃĴ
0.14
utz
0.14
éĩįè¦ģ
0.14
£¼
0.14
Activations Density 0.274%