INDEX
Explanations
short phrases related to current events or news headlines
sentence endings or punctuation marks in discussions of serious topics
New Auto-Interp
Negative Logits
predec
-0.78
iste
-0.71
unprotected
-0.69
manif
-0.68
explan
-0.67
defe
-0.66
allocation
-0.65
mosqu
-0.65
shaving
-0.63
multiplication
-0.62
POSITIVE LOGITS
Their
0.90
They
0.89
Besides
0.85
Its
0.82
*)
0.82
Such
0.82
These
0.81
Fortunately
0.81
Specifically
0.81
Additionally
0.81
Activations Density 0.487%