INDEX
Explanations
words related to morality and ethical behavior
New Auto-Interp
Negative Logits
ierrez
-0.65
izoph
-0.60
coasts
-0.60
ritch
-0.59
rooms
-0.57
awhile
-0.56
4090
-0.56
cooldown
-0.55
redo
-0.54
Estimated
-0.54
POSITIVE LOGITS
iak
0.94
stru
0.89
line
0.85
bol
0.85
lines
0.84
ko
0.83
ais
0.83
ī
0.80
SHIP
0.79
opter
0.78
Activations Density 0.029%