INDEX
Explanations
AI safety refusals
tokens associated with technical formatting—especially web links (protocols/domains), numerals, and punctuation-heavy code/markup fragments.
New Auto-Interp
Negative Logits
,
0.19
、
0.18
vertices
0.18
yaw
0.16
spindles
0.16
initialize
0.16
ballots
0.16
refills
0.16
numerator
0.15
thighs
0.15
POSITIVE LOGITS
Anyone
0.17
pesar
0.16
बेशक
0.16
situation
0.16
aventure
0.16
Historically
0.16
Anybody
0.15
Patrick
0.15
Generally
0.15
परिस्थिति
0.15
Activations Density 4.019%