This example demonstrates how to build an index for a GitHub repository using CocoIndex.
- We will ingest a GitHub repository.
- For each file, perform chunking (Tree-sitter) and then embedding.
- We will save the embeddings and the metadata in Postgres with PGVector.
- Create a
.envfile from.env.example, and fill configurations for your GitHub app.
Note: You need to configure the GitHub source with your repository details:
repo_name: The GitHub repository name (e.g., "owner/repo-name")branch: The branch to index (e.g., "main")private_key_path: Path to your private key for authentication
We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
Install Postgres if you don't have one.
-
Install dependencies:
pip install -e . -
Setup:
cocoindex setup main.py
-
Update index:
cocoindex update main.py
-
Run:
python main.py
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention. Run the following command to start CocoInsight:
cocoindex server -ci main.py
Then open the CocoInsight UI at https://cocoindex.io/cocoinsight.