INDEX
Explanations
arguments related to morality and hypocrisy
New Auto-Interp
Negative Logits
olio
-0.17
oho
-0.17
als
-0.16
icz
-0.15
dan
-0.15
inte
-0.15
essen
-0.15
inho
-0.15
ounty
-0.15
unch
-0.14
POSITIVE LOGITS
rather
0.34
nor
0.33
instead
0.32
merely
0.32
nor
0.31
Rather
0.30
rather
0.30
Nor
0.29
Rather
0.29
Instead
0.28
Activations Density 0.252%