wikidist

module
v0.0.0-...-949a5e8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 9, 2021 License: MIT

README

Demo : https://wikidist.rpelliard.com

About this project

How many links do you need to click to go from one Wikipedia article to another? Wikidist answers this question and tells you precisely on which links to click in order to achieve this.

You can use it as a guessing game. Or just to satisfy your curiosity :-)

example

Architecture overview

Wikidist has four major components, presented in the following sections.

Architecture overview

Graph database

The database stores a graph in which each vertex is a page from Wikipedia, and each (oriented) edge is a link from one page to another.

For the project, we chose dgraph, a somewhat recent but largely rising database engine because:

  • using a graph database (instead of a relational one) largely facilitates the computation of a shortest path between two pages
  • dgraph is written in Go, and thus naturally has a great Go client API
  • dgraph's performance exceeds the project's requirements, even when run on a single node

Crawler

Written in Go, code located in cmd/crawler.

The crawler, well, crawls a given instance of Wikipedia, and stores the visited or to-be-visited pages in the database.

It takes a starting page as one of its arguments, and uses the Wikipedia API to fetch the links from that page. It then creates all the corresponding vertices in the database, marking them as 'not visited', and proceeds to visit one of them. It keeps going until there is no non-visited vertex left.

The crawler uses a multi-worker architecture: a queue is filled with yet-to-visit articles from the database, the workers consume the article in the queue and fetch the linked articles from the wikimedia API. These linked articles are then updated in the database, forming new links and new unvisited articles that will be added to the queue. This architecture allows an efficient crawling of the graph with multiple articles queried at the same time. The bottleneck here is actually the rate limiting from the API.

We assume that Wikipedia is a strongly connected graph, meaning that we will reach every page starting from any page. This assumption may not be true, but is reasonable given that the users will most likely search for "popular" pages with many links.

API server

Written in Go, code located in cmd/api.

The API server receives requests from the front-end, fetches the necessary information from the database, and responds with the corresponding data.

It has three endpoints.

Usage:

POST /search {"search": "<your_search>", "depth": <depth>}

Result:

[
    {
        "title": "Title",
        "uid": "0xbeef",
        "linked_articles":[
            {...}
        ]
    },
    {...}
]

The search endpoint is meant to be used to find an article in the database given its title. It returns its title, uid and linked_articles. You can select the depth of the result (in terms of linked_articles). By default, the depth is 0 (i.e. there won't be any linked_articles in the result).

Warning: This endpoint may return multiple results, as the back-end fetches all articles at a (Levenshtein) distance of 2 from the requested title. This feature can be useful to correct typing errors or to offer an autocomplete functionality.

/search-uid

Usage:

POST /search-uid {"search": "<uid>", "depth": <depth>}

Result:

[
    {
        "title": "Title",
        "uid": "0xbeef",
        "linked_articles":[
            {...}
        ]
    },
    {...}
]

The search-uid endpoint is meant to be used to find an article in the database given its uid. It returns its title, uid and linked_articles. You can select the depth of the result (in terms of linked_articles). By default, the depth is 0 (i.e. there won't be any linked_articles in the result).

This endpoint is mainly used to find the neighbors of an article.

/shortest

Usage:

GET /shortest?from=<source-uid>&to=<target-uid>

Result:

[
    {
        "title": "Source",
        "uid": "0xbeef"
    },
    {...},
    {
        "title": "Target",
        "uid": "0xdeaf"
    }
]

The shortest endpoint is meant to be used to find the shortest path between two articles of the database.

It returns the titles and uids of all the interim articles along the route.

Front-end

Code located in frontend.

The front-end is implemented with the Vue.js framework. Graphs are rendered using d3.js. It also uses Typescript that brings typing to javascript.

The codebase is quite simple, business logic is located in src/services/wikidist.service.ts while rendering and ui reactions are managed by components. The src/types/articles.ts contains type definitions.

Eslint and prettier run alongside Vue to ensure good code formatting and quality of presentation.

Installation

Requirements

To run this project, you will need to have the following applications installed:

  • docker
  • docker-compose
  • node + yarn
  • go

Make sure that ports 5080, 6080, 8000, 8080, 8085 and 9080 are not currently in use on your machine.

The front-end will be accessible on your local port 8085. The database console will be accessible on local port 8000.

The easy way

The project is provided with a docker-compose file for an easy and fast installation. You can run the entire project minus the crawler by simply running:

docker-compose up -d

This will launch five containers:

  • dgraph's database driver
  • dgraph's API server
  • dgraph's frontend, Ratel
  • wikidist's API
  • wikidist's front-end

In order to launch the crawler, see here.

It is an easy way to test the project. However, if you really want to go deeper into the project, you can still install the different component by hand.

The long way

Dgraph

Use docker-compose to deploy dgraph:

docker-compose up -d dgraph-server dgraph-ratel dgraph-zero

This will launch the three containers used by dgraph:

  • dgraph's database driver
  • dgraph's API server
  • dgraph's frontend, Ratel

If for some reason you want to install it without using docker-compose, follow the instructions provided here.

API

The API server uses port 8081. To launch it:

cd cmd/api
go build
./api
Front-end
Configuration

By default, the front-end tries to contact a locally run API server (localhost:8081).

If you want your local front-end to contact a remote API server (with an already running and filled dgraph database), edit frontend/.env to:

VUE_APP_API_URL="https://wikidist.rpelliard.com/api"
Serving files

For development, you can serve the front-end with:

cd frontend
yarn install
yarn serve
Making a production build

For production, you should build static files using:

yarn build
Crawler

The crawler only needs to be executed once in order to fill the database.

To launch the crawler:

cd cmd/crawler
go build
./crawler <prefix> <start> <nb_worker>
  • prefix: prefix of the Wikipedia instance to be crawled (e.g. 'en', 'fr', etc.)
  • start: exact title of the page from which to start crawling (e.g. Alan Turing)
  • nb_worker: number of concurrent workers to use while crawling (e.g. 5) - the optimal value depends on the setup (you should use monitoring tools to help you choose an appropriate one)

For example:

./crawler fr "Alan Turing" 5

Monitoring

The crawling performance is critical for this project considering the number of articles to fetch (for example, the French version of Wikipedia has upwards of 2 million articles). It was therefore essential to identify and fix bottlenecks to be able to crawl the whole graph in a reasonable amount of time.

We used Datadog to analyze metrics like queue length, articles fetched per second and cache hit ratio. Displaying those metrics in a meaningful and useful way helped us a lot in improving the crawling rate.

monitoring

Testing

Unit tests

The fetcher part of package crawler is unit tested.

All 100% of statements from fetcher.go are covered.

To run tests:

cd pkg/crawler
go test .

To show test coverage:

cd pkg/crawler
go test -coverprofile=coverage.out
go tool cover -html=coverage.out

CI

After each git push, the CI (Github Actions) will run the tests and report back on the appropriate PR whether tests are passing or not.

The actions executed by the CI are defined in .github/workflows/test.yml.

Trello / Scrum

In this project, we used Github Pull Requests instead of Trello. Pull Requests are available here.

Directories

Path Synopsis
cmd
api
pkg
api
db

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL