INDEX
Explanations
phrases indicating pathways or methods toward achieving something
New Auto-Interp
Head Attr Weights
0:0.01
1:0.02
2:0.07
3:0.06
4:0.25
5:0.01
6:0.03
7:0.35
8:0.01
9:0.03
10:0.05
11:0.06
Negative Logits
describ
-1.78
DEFENSE
-1.58
owe
-1.55
commit
-1.54
quartered
-1.52
IRED
-1.51
burse
-1.51
hillary
-1.48
waive
-1.48
enough
-1.44
POSITIVE LOGITS
mush
1.69
Brill
1.47
Reality
1.45
tyranny
1.44
quot
1.43
mell
1.41
hordes
1.38
rama
1.36
reality
1.36
4090
1.32
Activations Density 0.001%