INDEX
Explanations
references to official websites
New Auto-Interp
Negative Logits
words
-0.77
teen
-0.67
DonaldTrump
-0.67
ciples
-0.61
scene
-0.58
trespass
-0.58
PLIED
-0.57
qus
-0.56
Fine
-0.56
downside
-0.56
POSITIVE LOGITS
ensen
1.05
sky
1.04
enson
0.99
asms
0.98
roup
0.98
ues
0.94
ersen
0.94
uin
0.91
hetti
0.89
nette
0.88
Activations Density 0.016%