INDEX
Explanations
phrases related to direction or orientation
references to directional movement or guidance
New Auto-Interp
Negative Logits
esters
-0.74
aqu
-0.73
unker
-0.73
enty
-0.70
akov
-0.69
itted
-0.68
athered
-0.67
ighth
-0.66
iltr
-0.66
Surviv
-0.65
POSITIVE LOGITS
direction
1.24
ality
1.09
directions
1.09
towards
0.95
toward
0.93
finding
0.85
finder
0.84
Direction
0.83
ward
0.79
ally
0.79
Activations Density 0.038%