INDEX
Explanations
comparisons between different entities or concepts
references to comparative structures or phrases
New Auto-Interp
Negative Logits
activ
-0.66
bys
-0.64
ende
-0.63
aps
-0.62
livest
-0.62
urai
-0.61
orr
-0.60
istries
-0.59
lapt
-0.58
bucks
-0.57
POSITIVE LOGITS
ordinary
0.86
oneself
0.80
icial
0.73
course
0.70
tains
0.67
lowly
0.65
stood
0.63
wered
0.63
ours
0.62
average
0.61
Activations Density 0.043%