INDEX
Explanations
phrases that indicate a generalization or conclusion
New Auto-Interp
Negative Logits
orns
-0.17
inja
-0.16
idis
-0.15
ẻ
-0.15
nap
-0.15
ipur
-0.14
ngör
-0.14
essim
-0.14
edic
-0.14
inya
-0.14
POSITIVE LOGITS
-called
0.20
exh
0.17
jaw
0.17
CKET
0.16
aft
0.15
far
0.15
aking
0.14
eff
0.14
il
0.14
benign
0.13
Activations Density 0.037%