INDEX
Explanations
phrases indicating a disconnection from reality or understanding
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.08
3:0.09
4:0.14
5:0.03
6:0.03
7:0.36
8:0.03
9:0.03
10:0.06
11:0.08
Negative Logits
ansom
-1.83
etheus
-1.79
slideshow
-1.75
perse
-1.67
leck
-1.65
rones
-1.62
roundup
-1.60
ginx
-1.56
gins
-1.52
gallery
-1.48
POSITIVE LOGITS
reality
1.92
realities
1.88
whence
1.52
stereotype
1.48
norm
1.44
Establishment
1.43
Saiyan
1.39
Dealer
1.38
sentiments
1.38
stereotypes
1.37
Activations Density 0.001%