NSWI153: WebCrawler

Site management

The application should allow a user to keep track of website records to crawl. For each website record the user can specify:

URL - where the crawler should start.
Boundary RegExp - when the crawler finds a link, the resolved full absolute URL must match this expression in order to be followed. User is required to provide a value for this.
Periodicity (minute, hour, day) - how often should the site be crawled.
Label - user given label.
Active / Inactive - if inactive, the site is not crawled based on the Periodicity.
Tags - user given strings.

The application should implement common CRUD operations.

The user can see website records in a paginated view. The view can be filtered using URL, Label, and/or Tags. The view can be sorted based on the URL or the last time a site was crawled. The view must contain Label, Periodicity, Tags, time of last execution, the status of last execution.

Execution management

Each active website record is executed based on the periodicity. Each execution creates a new execution. For example, if the Periodicity is an hour, the executor tries to crawl the site every hour ~ last execution time + 60 minutes. You may use the start of the last execution or the end of the last execution. While doing the first may not be safe, it does not matter here. If there is no execution for a given record and the record is active, the crawling is started as soon as possible; this should be implemented using some sort of a queue.

A user can list all the executions, or filter all executions for a single website record. In both cases, the list must be paginated. The list must contain website record's label, execution status, start/end time, number of sites crawled. A user can manually start an execution for a given website record. When a website record is deleted, all executions and relevant data are removed as well.

Executor

The executor is responsible for executing, i.e. crawling selected websites. The crawler downloads the website and looks for all hyperlinks. For each detected hyperlink that matches the website record Boundary RegExp, the crawler also crawls the given page. For each crawled website it creates a record with the following data:

URL
Crawl time
Title - page title
Links - list of outgoing links

Crawled data are stored as a part of the website record, so the old data are lost once the new execution is successfully finished. It must be possible to run multiple executions at once.

When crawling a website, do not query a host faster then one request per second. This rule can apply only in scope of one crawling thread. In fact you can implemented in such a way that the crawler thread executes at most one query per second.

BONUS: Update the executor so that not host is queried more then once per second regardless of number of executor threads. At the same time, the executor thread can make multiple request to different servers per second.

Visualisation

For selected website records (active selection) the user can view a map of crawled pages as a graph. Nodes are websites/domains. There is an directed edge (connection) from one node to another if there is a hyperlink connecting them in a given direction. The graph should also contain nodes for websites/domains that were not crawled due to a Boundary RegExp restriction. Those nodes will have different visuals so they can be easily identified.

A user can switch between website view and domain view. In the website view, every website is represented by a node. In the domain view, all nodes from a given domain (use the full domain name) are replaced by a single node.

By double-clicking the node, the user can open the node detail. For crawled nodes, the details contain URL, Crawl time, and a list of website records that crawled the given node. The user can start new executions for one of the listed website records. For other nodes, the detail contains only URL and the user can create and execute a new website record. The newly created website record is automatically added to the active selection and the mode is changed to live.

The visualisation can be in live or static mode. In static mode, data are not refreshed. In the live mode, data are periodically updated based on the new executions for active selection.

If a single node is crawled by multiple executions from active selection, data from the latest execution are used for the detail.

Use the page title or URL, in that order of preference, as a node label. For domain nodes, use the URL.

BONUS: Make sure that the position of the nodes in the visualisation does not change dramatically on update. If you utilize force-based layout, do not recompute all positions on every update.

API

The website record and execution CRUD must be exposed using an HTTP-based API documented using OpenAPI / Swagger.

Crawled data of all website records can be queried using GraphQL. The GraphQL model must "implement" the following schema:


type Query{
    websites: [WebPage!]!
    nodes(webPages: [ID!]): [Node!]!
}

type WebPage{
    identifier: ID!
    label: String!
    url: String!
    regexp: String!
    tags: [String!]!
    active: Boolean!
}

type Node{
    title: String
    url: String!
    crawlTime: String
    links: [Node!]!
    owner: [WebPage!]!
}

BONUS: You can replace GraphQL by MCP. Keep in mind that you will need to showcase the MCP functionality using your own computer.

Others

The application must provide a reasonable level of user experience.
You must employ linters and prettiers for most of your codebase.
No documentation is required.
When scraping a site, follow only <a href="..." .... You do not need to run the JavaScript.
The scraping parallelism must utilize more than one thread. It is not sufficient, for the purpose of the assignment, to employ just Node.js and the argument of async I/O. For example, Node.js supports worker threads — using them for crawling is ok.
Application data must be stored using a database.

WebCrawler

Site management

Execution management

Executor

Visualisation

API

Deployment

Presentation

Others