INDEX
Explanations
phrases related to rationalization and logic
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.04
3:0.08
4:0.16
5:0.04
6:0.04
7:0.31
8:0.03
9:0.07
10:0.08
11:0.07
Negative Logits
ibaba
-2.06
auga
-1.64
psey
-1.64
phabet
-1.64
keyes
-1.61
title
-1.57
ailability
-1.56
ighth
-1.56
bley
-1.55
vals
-1.55
POSITIVE LOGITS
guilt
1.86
inaction
1.66
selfish
1.59
greed
1.54
irrational
1.54
thinking
1.52
differently
1.48
fears
1.47
pleas
1.47
endlessly
1.45
Activations Density 0.001%