INDEX
Explanations
words associated with endings or conclusions
closing punctuation marks, indicating the end of sentences or segments
New Auto-Interp
Negative Logits
ategory
-0.76
ppo
-0.73
IGHTS
-0.72
kaya
-0.67
kefeller
-0.66
absor
-0.65
alcohol
-0.65
PDATE
-0.64
underest
-0.64
terness
-0.62
POSITIVE LOGITS
angered
1.02
orph
1.01
angering
0.95
ragon
0.94
lich
0.94
urance
0.93
erer
0.92
ering
0.88
ocrin
0.86
ulum
0.85
Activations Density 0.031%