INDEX
Explanations
patterns or elements indicative of scientific terminology and variables
New Auto-Interp
Negative Logits
ement
-0.54
Vere
-0.48
nent
-0.47
ame
-0.45
mani
-0.45
Ye
-0.45
TIS
-0.44
xis
-0.44
aren
-0.43
uket
-0.43
POSITIVE LOGITS
NameInMap
0.96
PW
0.94
MW
0.90
PW
0.89
AW
0.87
GW
0.85
LW
0.84
HW
0.81
BW
0.81
twimg
0.81
Activations Density 0.385%