INDEX
Explanations
personal pronouns standing alone
New Auto-Interp
Negative Logits
theless
-0.78
rules
-0.69
rooms
-0.67
imentary
-0.66
taboola
-0.62
cov
-0.61
combustion
-0.59
ded
-0.59
ynamic
-0.58
spawn
-0.58
POSITIVE LOGITS
OUS
1.11
AMI
1.09
YA
1.08
KE
1.04
BILITY
1.04
ALLY
0.99
RECT
0.99
BA
0.98
ANS
0.97
WI
0.97
Activations Density 0.031%