INDEX
Explanations
adjectives or nouns related to controversial, suspected, or challenging situations
New Auto-Interp
Negative Logits
lang
-0.91
wright
-0.90
ertodd
-0.89
someone
-0.86
ometers
-0.84
speak
-0.83
rollers
-0.82
rea
-0.82
aido
-0.81
Feet
-0.80
POSITIVE LOGITS
inability
1.06
absence
1.05
contribution
1.00
injunction
0.98
tendency
0.96
animosity
0.95
influx
0.95
arrival
0.95
threat
0.94
lack
0.94
Activations Density 2.793%