INDEX
Explanations
letters that are put together like 'Ċ'
instances of numerical data or metrics related to counts or references
New Auto-Interp
Negative Logits
ĸļ
-0.96
ername
-0.94
iage
-0.78
terday
-0.75
oun
-0.74
terness
-0.74
uto
-0.73
hement
-0.73
lihood
-0.70
agy
-0.66
POSITIVE LOGITS
Mand
0.85
Quotes
0.79
LESS
0.78
WIND
0.78
Viol
0.77
Trivia
0.76
Cla
0.76
OVER
0.76
Act
0.75
ORN
0.74
Activations Density 0.499%