INDEX
Explanations
phrases expressing intent or meaning
phrases that assert disclaimers or qualifications
New Auto-Interp
Negative Logits
ku
-0.77
awoken
-0.67
busters
-0.67
anded
-0.64
bonds
-0.63
knit
-0.63
dt
-0.62
Commands
-0.61
locked
-0.60
than
-0.58
POSITIVE LOGITS
necessarily
0.82
nor
0.80
anymore
0.80
exaggeration
0.79
disrespect
0.77
hesda
0.76
anyone
0.76
lightly
0.75
discouraged
0.72
anybody
0.72
Activations Density 0.226%