INDEX
Explanations
words related to existence or presence, particularly in a structured or formal context
New Auto-Interp
Negative Logits
er
-0.34
r
-0.26
ar
-0.24
र
-0.24
erse
-0.22
rage
-0.21
ORE
-0.20
اÙĨ
-0.20
ract
-0.20
rne
-0.20
POSITIVE LOGITS
hetics
0.27
hetic
0.25
ablish
0.24
ech
0.23
ev
0.22
eh
0.22
eb
0.21
ead
0.20
ee
0.20
ewart
0.20
Activations Density 0.041%