- Implemented a web crawler that fetches different cars’ data from a car information website
- Store structured cars’ data into
ElasticSearchthat runs as a web service in a Docker - Implemented a simple webpage that read data from
ElasticSearchand display in a data grid - Used Rpc for data communication among engine instance,
ItemSaverinstance, and distributedWorker’s instances
- Official website: https://golang.org/. And download the installation package from the website and install.
- Confirm
gohas been successfully installed by typing inYus-MacBook-Pro:~ yyu196$ go version go version go1.13.7 darwin/amd64 - Turn on
GO111MODULEand installgolangimportsdependency by typing inYus-MacBook-Pro:~ yyu196$ go env -w GO111MODULE=on Yus-MacBook-Pro:~ yyu196$ go get -v golang.org/x/tools/cmd/goimports - Install all
gorelated extensions in Visual Studio Code. - Create
go.modfile under./by typing inSuppose we have a simple helloworld go file calledYus-MacBook-Pro:GoLangIntro yyu196$ go mod init FundamentalGrammerFundamentalGrammerdirectory, and we can run the file under./which containsgo.modby typingYus-MacBook-Pro:GoLangIntro yyu196$ go run FundamentalGrammer/basic.go Hello World - Setup proxy in Mainland China. Go to https://github.com/goproxy/goproxy.cn/blob/master/README.zh-CN.md. Default proxy is
GOPROXY="https://proxy.golang.org,direct". Type ingo env -w GOPROXY=https://goproxy.cn,direct - Install gin framework by entering command
go get -u github.com/gin-gonic/gin. - Install zap library entering command
go get -u go.uber.org/zap
-
Docker is written in
Goprogramming language, directly go tohttps://hub.docker.com/editions/community/docker-ce-desktop-macto download stable version binary of the Docker software. -
In mainland China, we need to use a new registry to pull
Docker's mirror. InDockerdesktop -> Preferences setup, go to docker engine tab, and add"registry-mirrors": ["http://f1361db2.m.daocloud.io"]as a new item in the JSON object. When runningdocker info, the new registry mirror is expected to be seen. Please go tohttps://www.daocloud.io/mirror#accelarator-docfor more details. -
Install Elastic Search in
Dockerby typing in the following command in Terminal.docker pull docker.elastic.co/elasticsearch/elasticsearch:7.6.1 -
Run Elastic Search in Single Node mode by typing in the following command in Terminal.
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.6.1 -
Delete an index from ElasticSearch docker by executing
curl -XDELETE 'localhost:9200/car_profile'. -
In Elastic Search,
indexacts as a DB Name and a table andidacts like an entry into the table. Idea of type was removed in Elastic Search 7.
-
Create a
/SingleThreadCrawlerdirectory to store source codes. -
Diagrams of the system
-
Fetcher implemetation details
- Install Go Text library by entering
go get -u golang.org/x/textin the command line. - The Kanji is in a wrong encoding way, we need to do conversion. Call
transform.NewReaderon the original response body to Convert its decoder fromGBKtoUTF-8. - For code scalable reason, we also need to install Go Net library by entering
go get -u golang.org/x/net. This library offers a functionality to detect the decoder from an html text. - Create a new
determineEncodingthat takes in an response bodyio.Readingand returnencode.Encodingthat includes the decoder format.
- Install Go Text library by entering
-
Parser Implementation Diagram
-
Wrap
parserfunctionality as a struct inengine/types. Create the struct with parse function object and the name of the function. Expose the method to parse contents with Items and more urls in the parse function return.
-
Merge functionality of Fetcher and Parser into a worker function in
engine. -
For concurrent web crawler, we will simplement a scheduler that schedule
workers in theengine'sRunfunction. -
enginewill sendRequests to scheduler and scheduler will coordinateworkers to send request and parse information. Please refer to the diagram below.
-
A simple Scheduler will create a
goroutinefor each Request and have a single worker to act on allgoroutines. Please refer to the diagram below.
-
A Queued Scheduler set up two queues one for worker and the other for request. When a new worker or request comes it, it will add that worker or request item at the back of the queue. When we need the worker to work on the request, we pop both front item from Request and Worker Queue and feed request item into worker item which is a Channel of request, in
engine/worker, the worker function is going to fecth and parse the request. Please refer to the diagram below.
-
The following diagram illustrate how
ItemSaverworks in the project architecture.
-
Install Elastic Search client library, go to
https://github.com/olivere/elastic. Type ingo get github.com/olivere/elastic/v7to install Elastic Search 7's client library. -
Create a
save()function inpersist/itemsaver.goto save crawled items into Elastic Search system. -
Architecture for UI Display
-
The following diagram shows the architecture of a distributed crawler.
-
Current data flows under channels under a single instance of crawler. Next, we will use RPC Client and RPC Server to split
ItemSaverlogic into some distributed services. Please refer to the architecture below.
-
We need put
ItemSaverinto a separate service and expose Rpc call (ItemSaverService.Save) to the main engine. -
We also need to put
workerinto a separate service and expose Rpc call (CrawlService.Process) to the main engine. However, the data communicated between CrawlService and engine needs to be serialize and deserialize. Please refer to the diagram below.
-
To make
workerdistributed, we implement acreateClientPool()function that make an array of rpc.Client of the working (each work holds a host identical to an existing worker instance). We feed these rpc.Client one by one into a Channel of*rpc.Clientin a goroutine. Worker client sideCreateProcessormethod listens to this Channel and picks up a*rpc.Clientwhenever possible and pass data to one worker server instance through rpc. -
To run the distributed web crawler. First go to
DistributedCrawlerdirectory by running incd DistributedCrawler- Start an
ItemSaverserver by runninggo run persist/server/itemsaver.go --port=1234. - Start two
Workerserver instances by runninggo run worker/server/worker.go --port=9000andgo run worker/server/worker.go --port=9001. - Start the engine instance by running
go run main.go --itemsaver_host=":1234" --worker_hosts=":9000,:9001". - Both
Workerinstances should be able to fetch data from xcar website and engine can receive these data and passCarDetailinformation toItemSaverto store intoElasticSearch.
- Start an