INDEX
Explanations
phrases that involve a division or categorization into two groups or types
structures that categorize information or concepts
New Auto-Interp
Negative Logits
board
-0.69
atown
-0.68
idth
-0.67
ritic
-0.66
¶
-0.66
zman
-0.64
Ł
-0.64
enger
-0.63
eez
-0.62
nell
-0.61
POSITIVE LOGITS
halves
1.01
Firstly
0.89
viz
0.84
namely
0.80
sexes
0.80
hemat
0.79
Firstly
0.76
sides
0.73
\'
0.72
thirds
0.70
Activations Density 0.231%