INDEX
Explanations
words and phrases indicating outcomes or results
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.06
3:0.06
4:0.17
5:0.02
6:0.06
7:0.35
8:0.03
9:0.03
10:0.07
11:0.04
Negative Logits
Consent
-1.75
notes
-1.61
Pledge
-1.60
remembrance
-1.60
AES
-1.58
Hash
-1.56
mem
-1.54
ネ
-1.53
Memories
-1.53
Password
-1.52
POSITIVE LOGITS
unfair
1.73
cheaper
1.70
ophobic
1.68
gloom
1.67
smoother
1.64
ONSORED
1.58
absurdity
1.56
worse
1.56
unprepared
1.52
inefficient
1.50
Activations Density 0.001%