INDEX
Explanations
phrases asking for specific information or clarification
queries that seek clarification or explanation about specific topics
New Auto-Interp
Negative Logits
Runner
-0.75
mur
-0.74
bis
-0.72
gi
-0.71
oco
-0.71
adiq
-0.70
mates
-0.69
bart
-0.67
otos
-0.66
gio
-0.66
POSITIVE LOGITS
constitutes
1.29
transpired
1.27
happened
1.15
happens
1.10
qualifies
1.04
distinguishes
0.96
entails
0.95
separates
0.91
motiv
0.88
bothers
0.83
Activations Density 0.073%