New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector field #33022
Vector field #33022
Conversation
1. Dense vector PUT dindex { "mappings": { "_doc": { "properties": { "my_vector": { "type": "dense_vector" }, "my_text" : { "type" : "keyword" } } } } } PUT dinex/_doc/1 { "my_text" : "text1", "my_vector" : [ 0.5, 10, 6 ] } PUT dindex/_doc/2 { "my_text" : "text2", "my_vector" : [ 0.5, 10, 10] } GET dindex/_search { "query" : { "vector" : { "field" : "my_vector", "query_vector": [ 0.5, 10, 10] } } } Result: .... "hits": [ { "_index": "dindex", "_type": "_doc", "_id": "2", "_score": 1.0000001, "_source": { "my_text": "text1", "my_vector": [ 0.5, 10, 10 ] } }, { "_index": "dindex", "_type": "_doc", "_id": "1", "_score": 0.97016037, "_source": { "my_text": "text1", "my_vector": [ 0.5, 10, 6 ] } } ] 2. Sparse vector PUT sindex { "mappings": { "_doc": { "properties": { "my_vector": { "type": "sparse_vector" }, "my_text" : { "type" : "keyword" } } } } } PUT sindex/_doc/1 { "my_text" : "text1", "my_vector" : {"1": 0.5, "99": -0.5, "5": 1} } PUT sindex/_doc/2 { "my_text" : "text2", "my_vector" : {"103": 0.5, "4": -0.5, "5": 1} } GET sindex/_search { "query" : { "vector" : { "field" : "my_vector", "query_vector": {"99": -0.5, "1": 0.5, "5": 1} } } } Result: "hits": [ { "_index": "sindex", "_type": "_doc", "_id": "1", "_score": 0.99999994, "_source": { "my_text": "text1", "my_vector": { "1": 0.5, "99": -0.5, "5": 1 } } }, { "_index": "sindex", "_type": "_doc", "_id": "2", "_score": 0.6666666, "_source": { "my_text": "text2", "my_vector": { "103": 0.5, "4": -0.5, "5": 1 } } } ] Search with filter: GET sindex/_search { "query": { "bool": { "must" : { "match": { "my_text": "text2" } }, "should" : { "vector" : { "field" : "my_vector", "query_vector": {"99": -0.5, "1": 0.5, "5": 1} } } } } } Result: "hits": [ { "_index": "sindex", "_type": "_doc", "_id": "2", "_score": 0.6931472, "_source": { "my_text": "text2", "my_vector": { "103": 0.5, "4": -0.5, "5": 1 } } } ] 3. Implementation details 3.1 Dense Vector - BinaryDocValuesField - byte array -> - integer (number of dimensions) - array of integers (encoded array of float values) 3.2 Sparse Vector - BinaryDocValuesField - byte array -> - integer (number of dimenstions) - array of integers (encoded array of float values) - array of integers (array of integer dimensions) Relates to elastic#31615
@jpountz I would appreciate your feedback on the API and implementation details. |
Pinging @elastic/es-search-aggs |
The field looks ok to me. The thing that I'm unsure about is exposing this |
Remove vector query Add documentation Add tests
@jpountz Hi Adrien, I tried to address your feedback with new commits. I have removed Can you please continue to review when you have available time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mayya-sharipova. I did a first round and left some questions.
Number of dimensions in a vector can be arbitrary and can be | ||
different across documents. | ||
|
||
These vectors can be used for document scoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we say _re_scoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to use this field in scoring through script score functions. Do you think it qualifies for "scoring", or is still "re-scoring"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's keep "scoring".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let's keep scoring
=== Dense vector datatype | ||
|
||
A `dense_vector` field indexes dense vectors of float values. | ||
Number of dimensions in a vector can be arbitrary and can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put an upper limit on the number of dimensions and document it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we aim to support multi-valued fields?
[[dense-vector]] | ||
=== Dense vector datatype | ||
|
||
A `dense_vector` field indexes dense vectors of float values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"indexes" might be a bit misleading since they are not searchable?
|
||
@Override | ||
public DocValueFormat docValueFormat(String format, DateTimeZone timeZone) { | ||
return DocValueFormat.BINARY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's build a custom format that returns a float[] and test docvalue_fields with such fields? Or alternatively raise an error for now saying this is not supported. I don't like using the BINARY format because this will return a base64 string of the internal representation of this vector. And that would be a breaking change if we ever decided to actually support docvalue_fields on such fields.
if (buf.length < (offset + INT_BYTES)) { | ||
buf = ArrayUtil.grow(buf, (offset + INT_BYTES)); | ||
} | ||
NumericUtils.intToSortableBytes(Float.floatToIntBits(value), buf, offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need sortability or are we using this existing method for convenience?
throw new IllegalArgumentException("[dense_vector] field takes an array of floats, but got unexpected token " + token); | ||
} | ||
} | ||
NumericUtils.intToSortableBytes(dim, buf, 0); //recording number of dimensions at the beginning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this? we could figure out the number of dimensions by dividing the array length by 4?
@Override | ||
public void parse(ParseContext context) throws IOException { | ||
if (context.externalValueSet()) { | ||
throw new IllegalArgumentException("[sparse_vector] field can't be used in multi-fields"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
throw new IllegalArgumentException("[sparse_vector] field takes an object that maps a dimension number to a float, " + | ||
"but got unexpected token [" + token + "]"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this performs in O(n^2) it seems. Should we collect values in the order in which they occur in the array and sort in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However I agree that we should store dimensions in order
} | ||
NumericUtils.intToSortableBytes(dim, buf, 0); //recording number of dimensions at the beginning | ||
BinaryDocValuesField field = new BinaryDocValuesField(fieldType().name(), new BytesRef(buf, 0, offset)); | ||
context.doc().addWithKey(fieldType().name(), field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this shouldn't support multi-valued fields. You could do that by failing if context.doc().getByKey
is not null.
} catch (NumberFormatException e) { | ||
throw new IllegalArgumentException("[sparse_vector]'s dimensions should be integers represented as strings, but got [" | ||
+ context.parser().currentName() + "]"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we fail negative dimensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm wondering whether users would need more than eg. 65k dimensions. If we decide that not then we could encode the dimension on two bytes only.
41280f3
to
13986d2
Compare
13986d2
to
bb57049
Compare
@jpountz Adrien, thanks for your feedback. I have tried to address all your comments, can you please continue to review when you have available time? Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just catching up on this (very exciting) work and had a couple suggestions.
"[dense_vector] field has exceeded the maximum allowed number of dimensions of :[" + MAX_DIMS_COUNT + "]"); | ||
} | ||
} else { | ||
throw new IllegalArgumentException("[dense_vector] field takes an array of floats, but got unexpected token " + token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment: using XContentParserUtils#ensureExpectedToken
could be nice here and in the check below, as it will include the location where parsing failed in the error message.
|
||
// assert that after decoding the indexed value is equal to expected | ||
BytesRef vectorBR = ((BinaryDocValuesField) fields[0]).binaryValue(); | ||
float[] decodedValues = DenseVectorFieldMapper.decodeVector(vectorBR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it'd make sense to have dedicated unit tests for vector encoding + decoding, as the logic is fairly subtle? With a unit test (as opposed to an integration test like this one), you could test many cases quickly, and perhaps even use randomized inputs.
|
||
// insert dim into the sorted array of dims | ||
// insert value in the array of values in the same position as the corresponding dim | ||
public static void sortedInsertDimValue(float[] values, int[] dims, float value, int dim, final int dimIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment: it could be nice to pull these methods into a dedicated class like SparseVectorEncoder
to separate concerns a bit and allow for easy unit testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some more comments @mayya-sharipova.
A `dense_vector` field stores dense vectors of float values. | ||
The maximum number of dimensions that can be in a vector should | ||
not exceed 500. The number of dimensions can be | ||
different across documents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mention it only supports single-valued fields? Ie. you can't do
{
"my_vector": [ [1, 2], [3, 4] ]
}
or even
{
"my_object": [
{
"my_vector": [1,2]
},
{
"my_vector": [3,4]
}
]
}
|
||
Internally, each document's dense vector will be encoded as a binary | ||
doc value. Its size in bytes is equal to | ||
`INT_BYTES * NUMBER_OF_DIMENSIONS`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's directly say 4 * NUMBER_OF_DIMENSIONS
?
Number of dimensions in a vector can be arbitrary and can be | ||
different across documents. | ||
|
||
These vectors can be used for document scoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let's keep scoring
|
||
Internally, each sparse document vector will be encoded as a binary | ||
doc value with the size in bytes equal to | ||
`(INT_BYTES * 2) * NUMBER_OF_DIMENSIONS`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's say directly 6 * NUMBER_OF_DIMENSIONS?
dim = Integer.parseInt(context.parser().currentName()); | ||
if (dim < 0 || dim > MAX_DIMS_NUMBER) { | ||
throw new IllegalArgumentException("[sparse_vector]'s dimension number must be " + | ||
"a non-negative integer value not exceeding [" + MAX_DIMS_NUMBER + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's provide dim
in the message?
} | ||
} catch (NumberFormatException e) { | ||
throw new IllegalArgumentException("[sparse_vector]'s dimensions should be integers represented as strings, but got [" | ||
+ context.parser().currentName() + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add e
as a cause of this exception?
private static BytesRef encodeSparseVector(float[] values, int[] dims, int dimCount) { | ||
// 1. Sort dimensions in the ascending order and sort values in the same order as their corresponding dimensions | ||
// as we expect that data should be already provided as sorted by dim, | ||
// we expect just a single pass in this bubble sort algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this exceptation is maybe too strong given that hashs are not supposed to be ordered in JSON. FYI Lucene has utility classes to sort parallel arrays, see eg. InPlaceMergeSorter
, which also runs in linear time if the input is already sorted.
|
||
public static final String CONTENT_TYPE = "sparse_vector"; | ||
public static int MAX_DIMS_COUNT = 500; //maximum allowed number of dimensions | ||
public static int MAX_DIMS_NUMBER = 65000; //maximum allowed dimension's number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make it 65536? (1<<16)
buf[offset] = (byte) (dims[dim] >> 8); | ||
buf[offset+1] = (byte) dims[dim]; | ||
offset += 2; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal but it feels a bit weird to me to encode values before dimensions, it's a bit like encoding a value before its key?
@jpountz @jtibshirani Thanks Adrien and Julie for the feedback. I have addressed it in the last commit. Can you please continue to review when you have available time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mayya-sharipova, I left some comments mostly around adding the field name to error messages, but it looks good to me overall.
modules/mapper-extras/build.gradle
Outdated
@@ -20,4 +20,4 @@ | |||
esplugin { | |||
description 'Adds advanced field mappers' | |||
classname 'org.elasticsearch.index.mapper.MapperExtrasPlugin' | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's undo changes to this file?
/** | ||
* A {@link FieldMapper} for indexing a dense vector of floats. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line?
private DenseVectorFieldMapper(String simpleName, MappedFieldType fieldType, MappedFieldType defaultFieldType, | ||
Settings indexSettings, MultiFields multiFields, CopyTo copyTo) { | ||
super(simpleName, fieldType, defaultFieldType, indexSettings, multiFields, copyTo); | ||
assert fieldType.indexOptions().compareTo(IndexOptions.DOCS_AND_FREQS) <= 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm we could actually assert that fieldType.indexOptions() == IndexOptions.NONE
since this field is not indexed?
|
||
@Override | ||
public DocValueFormat docValueFormat(String format, DateTimeZone timeZone) { | ||
throw new UnsupportedOperationException("[dense_vector] field doesn't support doc values"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better message might be to say that it doesn't support docvalue_fields
and aggregations (which are the two features that use this API. Let's also include the field name in the message?
|
||
public static final String CONTENT_TYPE = "dense_vector"; | ||
public static short MAX_DIMS_COUNT = 500; //maximum allowed number of dimensions | ||
private static final byte INT_BYTES = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove this constant and use Integer.BYTES
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment, Adrien! I was thinking since we have already precoded 4 and 2 bytes calculations as below, wouldn't it be safer to keep these constants as 4
and 2
.
buf[offset] = (byte) (intValue >> 24);
buf[offset+1] = (byte) (intValue >> 16);
But if it is fine, I can do as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
import org.apache.lucene.util.InPlaceMergeSorter; | ||
|
||
// static utility functions for encoding and decoding dense_vector and sparse_vector fields | ||
public final class VectorEncoderDecoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it be made pkg-private?
// static utility functions for encoding and decoding dense_vector and sparse_vector fields | ||
public final class VectorEncoderDecoder { | ||
static final byte INT_BYTES = 4; | ||
static final byte SHORT_BYTES = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use Integer.BYTES and Short.BYTES instead
* @param vectorBR - vector decoded in BytesRef | ||
*/ | ||
static int[] decodeSparseVectorDims(BytesRef vectorBR) { | ||
int dimCount = (vectorBR.length - vectorBR.offset) / (INT_BYTES + SHORT_BYTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to subtract offset here, length
is the number of bytes rather than the end offset
* @param vectorBR - vector decoded in BytesRef | ||
*/ | ||
static float[] decodeSparseVector(BytesRef vectorBR) { | ||
int dimCount = (vectorBR.length - vectorBR.offset) / (INT_BYTES + SHORT_BYTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to subtract vectorBR.offset
0.001f | ||
); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's test indexing illegal vectors and check error messages?
@jpountz Thanks for the feedback, Adrien. I have addressed all your comments (except Integer.BYTES - which if necessary I can address as well). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @mayya-sharipova for the iterations and sorry it took so long!
|
||
public static final String CONTENT_TYPE = "dense_vector"; | ||
public static short MAX_DIMS_COUNT = 500; //maximum allowed number of dimensions | ||
private static final byte INT_BYTES = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@elasticmachine run the gradle build tests 1 |
@jpountz thanks a lot for the review |
Introduce dense and sparse vector fields that can be used for documents scoring.
This PR only introducing indexing of these fields.
Later PR will add script functions that can calculate a distance between
a given query vector and the indexed document vector.
1. Dense vector
2. Sparse vector
3 Implementation details
3.1 Dense Vector
3.2 Sparse Vector
Relates to #31615