INDEX
Explanations
words related to choices or decision-making
references to the word "which" in various contexts
New Auto-Interp
Negative Logits
swick
-0.89
renheit
-0.79
ISTORY
-0.78
bart
-0.76
ibaba
-0.76
wn
-0.76
emp
-0.74
̶
-0.74
shi
-0.73
yrinth
-0.73
POSITIVE LOGITS
ones
1.18
side
1.16
direction
1.09
wavelengths
1.03
kinds
1.00
aspects
0.99
hemisphere
0.98
parts
0.94
subset
0.91
facets
0.90
Activations Density 0.044%