INDEX
Explanations
words related to ethical and philosophical discussions
New Auto-Interp
Negative Logits
externalToEVAOnly
-0.65
seiz
-0.64
Mub
-0.61
olver
-0.60
stride
-0.59
ilogy
-0.59
dfx
-0.58
disg
-0.58
anchez
-0.56
submar
-0.56
POSITIVE LOGITS
ments
1.62
ment
1.46
MENT
1.33
Yourself
1.28
able
1.27
ables
1.25
ings
1.22
MENTS
1.14
ABLE
1.10
ability
1.10
Activations Density 0.178%