INDEX
Explanations
a variety of words related to observation or surveillance
New Auto-Interp
Negative Logits
ãĥ´
-0.79
esan
-0.72
enture
-0.70
bably
-0.69
eno
-0.67
lishes
-0.66
xual
-0.64
afort
-0.64
ãĤ¨ãĥ«
-0.63
wealth
-0.62
POSITIVE LOGITS
dog
1.32
tower
1.26
dogs
1.26
Watching
1.09
helpless
0.95
watching
0.95
attent
0.89
watch
0.89
closely
0.88
opes
0.88
Activations Density 1.672%