INDEX
Explanations
questions related to specific topics or entities
phrases that inquire about specific topics or concepts
New Auto-Interp
Negative Logits
ornings
-0.85
alde
-0.83
chairs
-0.82
adoes
-0.82
classes
-0.80
runs
-0.79
months
-0.78
aunts
-0.77
hops
-0.77
sheets
-0.76
POSITIVE LOGITS
difference
1.16
significance
1.14
Difference
1.00
takeaway
0.99
purpose
0.96
reperc
0.95
point
0.94
rationale
0.94
biggest
0.93
optimum
0.93
Activations Density 0.074%