INDEX
Explanations
phrases indicating awareness or realization of influence and consequence
New Auto-Interp
Negative Logits
ommen
-0.15
_unused
-0.14
inha
-0.14
zl
-0.14
ивÑĪи
-0.14
.om
-0.14
uto
-0.13
DataView
-0.13
xic
-0.13
ixer
-0.13
POSITIVE LOGITS
Aware
0.21
Conscious
0.21
conscious
0.20
Aware
0.20
awareness
0.18
Awareness
0.18
aware
0.17
aware
0.17
ampil
0.16
-aware
0.16
Activations Density 0.007%