INDEX
Explanations
phrases indicating something is not socially or morally appropriate
words related to acceptability or unacceptability
New Auto-Interp
Negative Logits
ynthesis
-0.83
ocket
-0.80
frey
-0.76
planes
-0.75
ilant
-0.73
lets
-0.72
dream
-0.72
helic
-0.69
berry
-0.69
wright
-0.68
POSITIVE LOGITS
GoldMagikarp
0.80
srfAttach
0.75
CPC
0.75
lihood
0.73
deviations
0.72
ible
0.71
âĶĢâĶĢ
0.70
precedent
0.69
itable
0.69
Gi
0.68
Activations Density 0.029%