INDEX
Explanations
phrases that make general observations or assertions
New Auto-Interp
Negative Logits
isman
-0.15
ide
-0.15
mitter
-0.15
onga
-0.15
apons
-0.14
igh
-0.14
atorial
-0.14
idan
-0.14
uture
-0.14
htar
-0.14
POSITIVE LOGITS
why
0.26
why
0.23
incident
0.21
itself
0.18
INCIDENT
0.17
(utf
0.17
Incident
0.16
istrovstvÃŃ
0.16
为ä»Ģä¹Ī
0.16
btw
0.16
Activations Density 0.080%