INDEX
Explanations
abstract concepts or ideas
references to abstract concepts or ideas
New Auto-Interp
Negative Logits
risome
-0.84
ICAN
-0.76
odder
-0.74
UNCH
-0.72
omore
-0.70
attering
-0.68
unker
-0.67
ogi
-0.67
asts
-0.66
artney
-0.66
POSITIVE LOGITS
ions
1.14
stract
0.93
edly
0.88
matter
0.85
algebra
0.84
edIn
0.82
abstract
0.80
syntax
0.77
urally
0.76
painter
0.76
Activations Density 0.018%