INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
agreed
0.62
agree
0.51
apoyo
0.48
opposition
0.46
AGRE
0.46
ace
0.44
agreement
0.44
isopropyl
0.43
fear
0.43
obey
0.43
POSITIVE LOGITS
Ratings
0.63
ratings
0.59
Ratings
0.57
ratings
0.56
Spending
0.48
routing
0.48
Rating
0.47
Pending
0.47
Spending
0.47
rating
0.46
Activations Density 0.002%