INDEX
Explanations
expressions related to moral or ethical correctness
doing the right thing
New Auto-Interp
Negative Logits
InputBorder
-0.50
ſever
-0.49
raiſ
-0.46
deſt
-0.45
uſed
-0.44
Reſ
-0.43
miſ
-0.43
PerformLayout
-0.42
tranſ
-0.42
myſelf
-0.42
POSITIVE LOGITS
فريبيس
0.54
<<<<<<<<<<<<<<
0.49
новниш
0.44
verwijspagina
0.44
afin
0.42
olin
0.41
ReusableCell
0.40
Right
0.39
ceği
0.38
SAFE
0.38
Activations Density 0.085%