Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector field #33022

Merged
merged 12 commits into from Dec 13, 2018
Merged

Vector field #33022

merged 12 commits into from Dec 13, 2018

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Aug 21, 2018

Introduce dense and sparse vector fields that can be used for documents scoring.
This PR only introducing indexing of these fields.
Later PR will add script functions that can calculate a distance between
a given query vector and the indexed document vector.

1. Dense vector

PUT dindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "dense_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT dinex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [ 0.5, 10, 6 ]
}

2. Sparse vector

PUT sindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "sparse_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT sindex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : {"1": 0.5, "99": -0.5,  "5": 1}
}

3 Implementation details

3.1 Dense Vector

  • BinaryDocValuesField
  • byte array ->
    • array of integers (encoded array of float values)

3.2 Sparse Vector

  • BinaryDocValuesField
  • byte array ->
    • array of integers (encoded array of float values)
    • array of two bytes (encoded array of integer dimensions)

Relates to #31615

1. Dense vector

PUT dindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "dense_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT dinex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [ 0.5, 10, 6 ]
}

PUT dindex/_doc/2
{
  "my_text" : "text2",
  "my_vector" : [ 0.5, 10, 10]
}

GET dindex/_search
{
  "query" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": [ 0.5, 10, 10]
        }
    }
}

Result:
....
"hits": [
    {
        "_index": "dindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.0000001,
        "_source": {
            "my_text": "text1",
            "my_vector": [
                0.5,
                10,
                10
            ]
        }
    },
    {
        "_index": "dindex",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.97016037,
        "_source": {
            "my_text": "text1",
            "my_vector": [
                0.5,
                10,
                6
            ]
        }
    }
]

2. Sparse vector

PUT sindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "sparse_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT sindex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : {"1": 0.5, "99": -0.5,  "5": 1}
}

PUT sindex/_doc/2
{
  "my_text" : "text2",
  "my_vector" : {"103": 0.5, "4": -0.5,  "5": 1}
}

GET sindex/_search
{
  "query" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": {"99": -0.5,  "1": 0.5,  "5": 1}
        }
    }
}

Result:
"hits": [
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.99999994,
        "_source": {
            "my_text": "text1",
            "my_vector": {
                "1": 0.5,
                "99": -0.5,
                "5": 1
            }
        }
    },
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6666666,
        "_source": {
            "my_text": "text2",
            "my_vector": {
                "103": 0.5,
                "4": -0.5,
                "5": 1
            }
        }
    }
]

Search with filter:

GET sindex/_search
{
  "query": {
    "bool": {
      "must" : {
        "match": {
          "my_text": "text2"
        }
      },
      "should" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": {"99": -0.5,  "1": 0.5,  "5": 1}
        }
      }
    }
  }
}

Result:
"hits": [
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
            "my_text": "text2",
            "my_vector": {
                "103": 0.5,
                "4": -0.5,
                "5": 1
            }
        }
    }
]

3. Implementation details

3.1 Dense Vector
- BinaryDocValuesField
- byte array ->
    - integer (number of dimensions)
    - array of integers (encoded array of float values)

3.2 Sparse Vector
- BinaryDocValuesField
- byte array ->
    - integer (number of dimenstions)
    - array of integers (encoded array of float values)
    - array of integers (array of integer dimensions)

Relates to elastic#31615
@mayya-sharipova
Copy link
Contributor Author

@jpountz I would appreciate your feedback on the API and implementation details.
This is WIP, it doesn't contain tests and documentation, the goal is to get feedback.

@mayya-sharipova mayya-sharipova added the :Search/Ranking Scoring, rescoring, rank evaluation. label Aug 21, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jpountz
Copy link
Contributor

jpountz commented Aug 23, 2018

The field looks ok to me. The thing that I'm unsure about is exposing this vector query: this query doesn't care much about matching, only about scoring. If you want to find the top documents sorted by cosine similarity across the whole index, Elasticsearch will run a linear scan. This is why I initially suggested exposing it as a rescorer instead. If the concern is that rescorers can't run on all hits, maybe there are other options we could think about like having some companion utility of function_score (or its replacement(s)) which would allow to easily fold cosine similarity into the score?

@mayya-sharipova
Copy link
Contributor Author

@jpountz Hi Adrien, I tried to address your feedback with new commits. I have removed vector_query, and this PR is only concerned with introducing two new field types: dense_vector and sparse_vector. Later I plan to introduce functions to script score query that will do distance calculations based on these fields.

Can you please continue to review when you have available time.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayya-sharipova. I did a first round and left some questions.

Number of dimensions in a vector can be arbitrary and can be
different across documents.

These vectors can be used for document scoring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we say _re_scoring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to use this field in scoring through script score functions. Do you think it qualifies for "scoring", or is still "re-scoring"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's keep "scoring".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's keep scoring

=== Dense vector datatype

A `dense_vector` field indexes dense vectors of float values.
Number of dimensions in a vector can be arbitrary and can be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put an upper limit on the number of dimensions and document it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we aim to support multi-valued fields?

[[dense-vector]]
=== Dense vector datatype

A `dense_vector` field indexes dense vectors of float values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"indexes" might be a bit misleading since they are not searchable?


@Override
public DocValueFormat docValueFormat(String format, DateTimeZone timeZone) {
return DocValueFormat.BINARY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's build a custom format that returns a float[] and test docvalue_fields with such fields? Or alternatively raise an error for now saying this is not supported. I don't like using the BINARY format because this will return a base64 string of the internal representation of this vector. And that would be a breaking change if we ever decided to actually support docvalue_fields on such fields.

if (buf.length < (offset + INT_BYTES)) {
buf = ArrayUtil.grow(buf, (offset + INT_BYTES));
}
NumericUtils.intToSortableBytes(Float.floatToIntBits(value), buf, offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need sortability or are we using this existing method for convenience?

throw new IllegalArgumentException("[dense_vector] field takes an array of floats, but got unexpected token " + token);
}
}
NumericUtils.intToSortableBytes(dim, buf, 0); //recording number of dimensions at the beginning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? we could figure out the number of dimensions by dividing the array length by 4?

@Override
public void parse(ParseContext context) throws IOException {
if (context.externalValueSet()) {
throw new IllegalArgumentException("[sparse_vector] field can't be used in multi-fields");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

throw new IllegalArgumentException("[sparse_vector] field takes an object that maps a dimension number to a float, " +
"but got unexpected token [" + token + "]");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this performs in O(n^2) it seems. Should we collect values in the order in which they occur in the array and sort in the end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However I agree that we should store dimensions in order

}
NumericUtils.intToSortableBytes(dim, buf, 0); //recording number of dimensions at the beginning
BinaryDocValuesField field = new BinaryDocValuesField(fieldType().name(), new BytesRef(buf, 0, offset));
context.doc().addWithKey(fieldType().name(), field);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this shouldn't support multi-valued fields. You could do that by failing if context.doc().getByKey is not null.

} catch (NumberFormatException e) {
throw new IllegalArgumentException("[sparse_vector]'s dimensions should be integers represented as strings, but got ["
+ context.parser().currentName() + "]");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fail negative dimensions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm wondering whether users would need more than eg. 65k dimensions. If we decide that not then we could encode the dimension on two bytes only.

@mayya-sharipova
Copy link
Contributor Author

@jpountz Adrien, thanks for your feedback. I have tried to address all your comments, can you please continue to review when you have available time? Thank you

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just catching up on this (very exciting) work and had a couple suggestions.

"[dense_vector] field has exceeded the maximum allowed number of dimensions of :[" + MAX_DIMS_COUNT + "]");
}
} else {
throw new IllegalArgumentException("[dense_vector] field takes an array of floats, but got unexpected token " + token);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment: using XContentParserUtils#ensureExpectedToken could be nice here and in the check below, as it will include the location where parsing failed in the error message.


// assert that after decoding the indexed value is equal to expected
BytesRef vectorBR = ((BinaryDocValuesField) fields[0]).binaryValue();
float[] decodedValues = DenseVectorFieldMapper.decodeVector(vectorBR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd make sense to have dedicated unit tests for vector encoding + decoding, as the logic is fairly subtle? With a unit test (as opposed to an integration test like this one), you could test many cases quickly, and perhaps even use randomized inputs.


// insert dim into the sorted array of dims
// insert value in the array of values in the same position as the corresponding dim
public static void sortedInsertDimValue(float[] values, int[] dims, float value, int dim, final int dimIndex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment: it could be nice to pull these methods into a dedicated class like SparseVectorEncoder to separate concerns a bit and allow for easy unit testing.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some more comments @mayya-sharipova.

A `dense_vector` field stores dense vectors of float values.
The maximum number of dimensions that can be in a vector should
not exceed 500. The number of dimensions can be
different across documents.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mention it only supports single-valued fields? Ie. you can't do

{
  "my_vector": [ [1, 2], [3, 4] ]
}

or even

{
  "my_object":  [
    {
      "my_vector": [1,2]
    },
    {
      "my_vector": [3,4]
    }
  ]
}


Internally, each document's dense vector will be encoded as a binary
doc value. Its size in bytes is equal to
`INT_BYTES * NUMBER_OF_DIMENSIONS`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's directly say 4 * NUMBER_OF_DIMENSIONS?

Number of dimensions in a vector can be arbitrary and can be
different across documents.

These vectors can be used for document scoring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's keep scoring


Internally, each sparse document vector will be encoded as a binary
doc value with the size in bytes equal to
`(INT_BYTES * 2) * NUMBER_OF_DIMENSIONS`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's say directly 6 * NUMBER_OF_DIMENSIONS?

dim = Integer.parseInt(context.parser().currentName());
if (dim < 0 || dim > MAX_DIMS_NUMBER) {
throw new IllegalArgumentException("[sparse_vector]'s dimension number must be " +
"a non-negative integer value not exceeding [" + MAX_DIMS_NUMBER + "]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's provide dim in the message?

}
} catch (NumberFormatException e) {
throw new IllegalArgumentException("[sparse_vector]'s dimensions should be integers represented as strings, but got ["
+ context.parser().currentName() + "]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add e as a cause of this exception?

private static BytesRef encodeSparseVector(float[] values, int[] dims, int dimCount) {
// 1. Sort dimensions in the ascending order and sort values in the same order as their corresponding dimensions
// as we expect that data should be already provided as sorted by dim,
// we expect just a single pass in this bubble sort algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this exceptation is maybe too strong given that hashs are not supposed to be ordered in JSON. FYI Lucene has utility classes to sort parallel arrays, see eg. InPlaceMergeSorter, which also runs in linear time if the input is already sorted.


public static final String CONTENT_TYPE = "sparse_vector";
public static int MAX_DIMS_COUNT = 500; //maximum allowed number of dimensions
public static int MAX_DIMS_NUMBER = 65000; //maximum allowed dimension's number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make it 65536? (1<<16)

buf[offset] = (byte) (dims[dim] >> 8);
buf[offset+1] = (byte) dims[dim];
offset += 2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal but it feels a bit weird to me to encode values before dimensions, it's a bit like encoding a value before its key?

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Nov 30, 2018

@jpountz @jtibshirani Thanks Adrien and Julie for the feedback. I have addressed it in the last commit. Can you please continue to review when you have available time?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayya-sharipova, I left some comments mostly around adding the field name to error messages, but it looks good to me overall.

@@ -20,4 +20,4 @@
esplugin {
description 'Adds advanced field mappers'
classname 'org.elasticsearch.index.mapper.MapperExtrasPlugin'
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's undo changes to this file?

/**
* A {@link FieldMapper} for indexing a dense vector of floats.
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line?

private DenseVectorFieldMapper(String simpleName, MappedFieldType fieldType, MappedFieldType defaultFieldType,
Settings indexSettings, MultiFields multiFields, CopyTo copyTo) {
super(simpleName, fieldType, defaultFieldType, indexSettings, multiFields, copyTo);
assert fieldType.indexOptions().compareTo(IndexOptions.DOCS_AND_FREQS) <= 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm we could actually assert that fieldType.indexOptions() == IndexOptions.NONE since this field is not indexed?


@Override
public DocValueFormat docValueFormat(String format, DateTimeZone timeZone) {
throw new UnsupportedOperationException("[dense_vector] field doesn't support doc values");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better message might be to say that it doesn't support docvalue_fields and aggregations (which are the two features that use this API. Let's also include the field name in the message?


public static final String CONTENT_TYPE = "dense_vector";
public static short MAX_DIMS_COUNT = 500; //maximum allowed number of dimensions
private static final byte INT_BYTES = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this constant and use Integer.BYTES instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, Adrien! I was thinking since we have already precoded 4 and 2 bytes calculations as below, wouldn't it be safer to keep these constants as 4 and 2.

buf[offset] =  (byte) (intValue >> 24);
buf[offset+1] = (byte) (intValue >> 16);

But if it is fine, I can do as you suggested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

import org.apache.lucene.util.InPlaceMergeSorter;

// static utility functions for encoding and decoding dense_vector and sparse_vector fields
public final class VectorEncoderDecoder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be made pkg-private?

// static utility functions for encoding and decoding dense_vector and sparse_vector fields
public final class VectorEncoderDecoder {
static final byte INT_BYTES = 4;
static final byte SHORT_BYTES = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use Integer.BYTES and Short.BYTES instead

* @param vectorBR - vector decoded in BytesRef
*/
static int[] decodeSparseVectorDims(BytesRef vectorBR) {
int dimCount = (vectorBR.length - vectorBR.offset) / (INT_BYTES + SHORT_BYTES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to subtract offset here, length is the number of bytes rather than the end offset

* @param vectorBR - vector decoded in BytesRef
*/
static float[] decodeSparseVector(BytesRef vectorBR) {
int dimCount = (vectorBR.length - vectorBR.offset) / (INT_BYTES + SHORT_BYTES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to subtract vectorBR.offset

0.001f
);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's test indexing illegal vectors and check error messages?

@mayya-sharipova
Copy link
Contributor Author

@jpountz Thanks for the feedback, Adrien. I have addressed all your comments (except Integer.BYTES - which if necessary I can address as well).

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @mayya-sharipova for the iterations and sorry it took so long!


public static final String CONTENT_TYPE = "dense_vector";
public static short MAX_DIMS_COUNT = 500; //maximum allowed number of dimensions
private static final byte INT_BYTES = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine run the gradle build tests 1

@mayya-sharipova mayya-sharipova merged commit b5d532f into elastic:master Dec 13, 2018
@mayya-sharipova
Copy link
Contributor Author

@jpountz thanks a lot for the review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants