diff --git a/docs/sphinx/source/query-builder.ipynb b/docs/sphinx/source/query-builder.ipynb new file mode 100644 index 000000000..e56e2c26e --- /dev/null +++ b/docs/sphinx/source/query-builder.ipynb @@ -0,0 +1,489 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "WORK_IN_PROGRESS\n", + "\n", + "\n", + " \n", + " \n", + " \"#Vespa\"\n", + "\n", + "\n", + "# pyvespa Query builder\n", + "\n", + "The Query Builder in pyvespa provides a more 'pythonic' way of building YQL query strings and this guide goes through constructing queries for the [Covid-19 Vespa instance](https://cord19.vespa.ai/) and the [vespa search instance](https://api.search.vespa.ai)\n", + "\n", + "While simple queries can be built with string formatting; building complex queries that contain a lot of dynamic parameters can get difficult and this is where the query builder comes in handy.\n", + "\n", + "The query builder is only responsible to build a YQL string which then can be used for querying using various methods (see query notebook).\n", + "\n", + "First lets install pyvespa that supports query builder:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install pyvespa>=0.52.0" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from vespa.application import Vespa\n", + "from vespa import querybuilder as qb\n", + "\n", + "app = Vespa(url=\"https://api.cord19.vespa.ai\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At an abstract level a query can be divided into 2 parts: \n", + " - `QueryField`: the fields part of the schema and needs to be queried on. For example in the Covid-19 schema `title`, `abstract`, etc can be used as fields.\n", + " - `Condition`: Logic operators used to combine the queryfields (and, or, etc)\n", + "\n", + "First lets take a look at the schema of the documents we will be querying on. And the best way to do this, is to use the `True` condition:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yql_query = qb.select(\"*\").from_(\"doc\").where(True)\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, hits=1)\n", + "resp.hits[0][\"fields\"].keys()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Do note that retrival from multiple sources can also be done if a tuple is provied to `from_()`: example: `from_(sd1, sd2) `.\n", + "\n", + "Lets start with something simple: building a query for fetching documents with a specific keyword in the title.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "title_field = qb.QueryField(\"title\")\n", + "condition = title_field.contains(\"vaccine\")\n", + "yql_query = qb.select([\"title\", \"abstract\"]).from_(\"doc\").where(condition)\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, hits=1)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While the `select()` accepts a list of QueryFields you can also give a string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yql_query = qb.select(\"title, abstract\").from_(\"doc\").where(condition)\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, hits=1)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The same query can be extented by chaining it with the `set_timeout()` to change the default timeout:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yql_query = qb.select(\"title, abstract\").from_(\"doc\").where(condition).set_timeout(70)\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, hits=1)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Which can then further be chained with other parameters like `limit()`, `orderByAsc()` or just `add_parameter()`; And you don't have to worry about the order of the function chain because the query builder will safely handel the order and generate a valid YQL string. Some parameter options:\n", + " - `set_offset()`\n", + " - `set_limit()`\n", + " - `set_timeout()`\n", + " - `orderByDesc()`\n", + " - `orderByAsc()`\n", + " - `add_parameter()`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yql_query = (\n", + " qb.select(\"title, abstract\")\n", + " .from_(\"doc\")\n", + " .where(condition)\n", + " .set_timeout(70)\n", + " .set_limit(2)\n", + " .orderByAsc(qb.QueryField(\"timestamp\"))\n", + ")\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, hits=2)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And when there is a need to add some annotations to a conditions or a field; you can do that with `annotate()`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "condition = title_field.contains(\"vaccine\")\n", + "annotated_condtition = condition.annotate({\"highlight\": True})\n", + "yql_query = qb.select(\"title\").from_(\"doc\").where(annotated_condtition)\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, hits=1)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also invert conditions after building them by just adding the `~` symbol in front of the condition." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "condition = ~condition\n", + "yql_query = qb.select(\"title, abstract\").from_(\"doc\").where(condition)\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, hits=1)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now lets say you wanted to query over multiple fields and combine them in a logical way along with mixing, matching logic operators:\n", + " - `all()` for logical AND.\n", + " - `any()` for local OR." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "title = qb.QueryField(\"title\")\n", + "abstract = qb.QueryField(\"abstract\")\n", + "source = qb.QueryField(\"source\")\n", + "\n", + "condition_1 = title.contains(\"vaccine\")\n", + "condition_2 = abstract.contains(\"bad\")\n", + "condition_3 = source.contains(\"PMC\")\n", + "condition_4 = title.contains(\"anti\")\n", + "condition = qb.all(condition_1, ~condition_2, qb.any(condition_3, ~condition_4))\n", + "\n", + "yql_query = qb.select([\"title\", \"abstract\"]).from_(\"doc\").where(condition)\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, hits=1)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also build the same query using the logical operators without `any()` or `all()` but by using:\n", + " - `&` for AND\n", + " - `|` for OR" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "title = qb.QueryField(\"title\")\n", + "abstract = qb.QueryField(\"abstract\")\n", + "source = qb.QueryField(\"source\")\n", + "\n", + "condition_1 = title.contains(\"vaccine\")\n", + "condition_2 = abstract.contains(\"bad\")\n", + "condition_3 = source.contains(\"PMC\")\n", + "condition_4 = title.contains(\"anti\")\n", + "\n", + "condition = (condition_1 & ~condition_2) & (condition_3 | ~condition_4)\n", + "yql_query = qb.select([\"title\", \"abstract\"]).from_(\"doc\").where(condition)\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, hits=1)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When you do want to query over numeric fields. Query builder provides both: the logical operator method (`>`, `<` and `=`) or the comparison methods: `ge()`, `le()` or `eq()`\n", + "\n", + "Or just use `in_range()` which can be used to achieve similar results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "timestamp = qb.QueryField(\"timestamp\")\n", + "id_filed = qb.QueryField(\"id\")\n", + "\n", + "condition_1 = (timestamp > 1554501500) & (timestamp < 1554501700)\n", + "condition_2 = id_filed.in_range(11105, 11305)\n", + "condition_3 = timestamp.eq(0)\n", + "condition = qb.any(condition_1, condition_2) | condition_3\n", + "yql_query = qb.select([\"title\", \"abstract\"]).from_(\"doc\").where(condition)\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, hits=2)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And if you dont like either of these approches, you can just use regex/fuzzy matching using `matches()`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "condition = qb.QueryField(\"title\").matches(\"covid\")\n", + "yql_query = qb.select([\"title\", \"abstract\"]).from_(\"doc\").where(condition)\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, hits=1)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The true power comes when you build each condition separately, enhancing long-term maintainability and readability.\n", + "\n", + "QB also offers a `userQuery()` which can be used as a condition for when you want to keep the query itself independent." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "condition = qb.userQuery()\n", + "yql_query = qb.select(\"title, abstract\").from_(\"doc\").where(condition)\n", + "print(f\"Built query: {yql_query}\")\n", + "resp = app.query(yql=yql_query, query=\"guards adults too, for now\", hits=1)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Query builder also provides a `in_()` which can be used to query multiple values in a single field." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "id_field = qb.QueryField(\"id\")\n", + "condition = id_field.in_(11205, 81788, 117939, 287995)\n", + "yql_query = qb.select(\"title, id\").from_(\"doc\").where(condition)\n", + "resp = app.query(yql=yql_query, hits=4)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And the same method can be used if the query field was of a string data type:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "source_field = qb.QueryField(\"source\")\n", + "condition = source_field.in_(\"PMC\", \"Medline\")\n", + "yql_query = qb.select(\"title, source\").from_(\"doc\").where(condition)\n", + "resp = app.query(yql=yql_query, hits=2)\n", + "resp.hits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The query builder provides the `wand()` interface along with adding appropriate annotations as well (refer [wand documentation](hhttps://docs.vespa.ai/en/using-wand-with-vespa.html#wand)):\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "condition = qb.wand(\n", + " \"title\",\n", + " weights={\"alphavirus\": 1, \"vaccine\": 2},\n", + " annotations={\"scoreThreshold\": 0.13, \"targetHits\": 2},\n", + ")\n", + "yql_query = qb.select(\"title, source\").from_(\"doc\").where(condition)\n", + "resp = app.query(yql=yql_query, hits=2)\n", + "\n", + "resp.hits\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now thats how you do the basic querying using the query builder. But obviously, the real world queries tend to be very complex with grouping, nearest Neighbour, dot products etc. And the query builder provides a clean interface for all of them as well. Again, it only generates a syntactically correct YQL string. All checks like using the right field, or the operator will have to be managed on the cleint side.\n", + "\n", + "For the usage of these methods, we move to the Vespa Search application which has a much more complex schema:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "search_app = Vespa(url=\"https://api.search.vespa.ai\")\n", + "\n", + "\n", + "resp = search_app.query(yql=\"select * from documentation where true limit 2\")\n", + "resp.hits\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "condition = qb.nearestNeighbor(\n", + " field=\"text_embedding\", query_vector=\"q\", annotations={\"targetHits\": 5}\n", + ")\n", + "\n", + "yql_query = qb.select(\"*\").from_(\"documentation\").where(condition)\n", + "print(f\"Built query: {yql_query}\")\n", + "\n", + "resp = search_app.query(\n", + " yql=yql_query,\n", + " ranking=\"hybrid2\",\n", + " query=\"How can the query builder be used to create a query?\",\n", + " body={\"input.query(q)\": \"embed(@query)\"},\n", + ")\n", + "resp.hits\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now all of these operations can also be combined with the `nearestNeighbor()`. The `annotations` can be used to modify the behaviour. And this one is always required because this is also how we set the `targetHits` which defaults to 10. The query builder is a perfect fit here because to achieve the same YQL query with python formatting you will have to escape the curly brackets which sometimes can become messy. Now all of that is replaced with a clean API." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Find messages where terms appear close to each other\n", + "text = qb.QueryField(\"text\")\n", + "condition = text.contains(qb.near(\"send\", \"qps\", distance=5))\n", + "yql_query = qb.select([\"message_id\", \"text\"]).from_(\"documentation\").where(condition)\n", + "print(f\"Built query: {yql_query}\")\n", + "\n", + "resp = search_app.query(yql=yql_query, hits=2)\n", + "resp.hits\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}