INDEX
Explanations
prepositions and location-related terms
prepositions and expressions of relationships in text
New Auto-Interp
Negative Logits
hement
-0.85
ategory
-0.80
antine
-0.79
hower
-0.69
heastern
-0.67
icularly
-0.66
nodd
-0.64
icut
-0.64
nor
-0.64
ering
-0.63
POSITIVE LOGITS
Humanity
0.94
Noise
0.87
Hate
0.83
Geek
0.83
Represent
0.80
Extrem
0.78
Difference
0.77
Computing
0.76
Anarch
0.75
Blind
0.75
Activations Density 0.279%