INDEX
Explanations
phrases indicating explanation or justification
expressions indicating justification or explanation
New Auto-Interp
Negative Logits
Blue
-0.62
Dragon
-0.60
ModLoader
-0.60
TR
-0.59
rab
-0.58
Enough
-0.57
cause
-0.57
Tokens
-0.57
procedural
-0.55
VERSION
-0.55
POSITIVE LOGITS
SPONSORED
0.75
alone
0.72
士
0.71
idents
0.66
phas
0.65
zik
0.63
we
0.60
gha
0.60
akedown
0.59
contrasts
0.59
Activations Density 0.080%