- Documents represented as vectors in a multi-dimensional Euclidean space
- Each axis = a term (token)
- Coordinate of document d in direction of term t determined by:
- Term frequency TF(d,t)
- Inverse document frequency IDF(t)
- to scale down the coordinates of terms that occur in many documents
Term frequency Inverse document frequency - Given
- D is the document collection and is the set of documents containing t
- Formulae
Vector space model - Coordinate of document d in axis t
- Query q
- Interpreted as a document
- Transformed to in the same TFIDF-space as d
- Distance measure
- Magnitude of the vector difference
- Document vectors must be normalized to unit ( or ) length
- Else shorter documents dominate (since queries are short)
- Cosine similarity
Relevance feedback - Users learning how to modify queries
- Response list must have least some relevant documents
- Relevance feedback
- Rocchio's method
- Folding-in user feedback
- To query vector
- .
Do'stlaringiz bilan baham: |