INDEX
Explanations
news-related terms and names of specific individuals involved in political actions or events
New Auto-Interp
Negative Logits
hetti
-0.85
Defin
-0.67
regor
-0.66
asse
-0.66
pired
-0.66
habitable
-0.65
comma
-0.65
ength
-0.64
inho
-0.62
asus
-0.61
POSITIVE LOGITS
worthy
1.07
reader
1.05
flash
1.03
agency
1.01
room
0.99
groups
0.98
worthiness
0.97
agent
0.96
group
0.95
conference
0.95
Activations Density 0.031%