Installation

 

RuiJi.Net is a crawler framework that can be distributed and distributed. It is written in C#. The ultimate goal of the project is to automatically check and crawl a large number of websites. Users can set the check interval of news sources and detect the news. After the source is updated, the updated address is sent to the download node, and the extracted node extracts and cleans the data.

RuiJi.Net supports self-managing cookies and automatically maintains cookies for different browser identities, which means you can virtualize any number of cookies based on independent IP. RuiJi.Net also supports the use of IP on the crawl server. Polling is crawled and a proxy server can also be used.

RuiJi.Net has its own extraction model called RuiJi Expression (RuiJi Expression), you can use RuiJi expression to define the extraction model and accurately clean the metadata that needs to be extracted. RuiJi expressions can be stored as text documents, databases, and caches.

Start

RuiJi.Net has three modes of operation, namely local mode, pseudo distributed mode, and full distributed mode.

You need to get the source code from github or gitee and compile the Ruiji.Net project.

You can also download the compiled project through the network disk at  https://pan.baidu.com/s/1xZFIGT1FF_toXzs42qPLUw  access password ef8d

 

If you compile your project yourself, please do the following

  1.  Download the Zip package at  https://github.com/zhupingqi/RuiJi.Net  or  https://gitee.com/zhupingqi/RuiJi.Net and extract it to your destination folder
  2. Open the project with vs2017, restore the dependencies and compile the project.
  3. Go to the RuiJi.Net.Cmd directory and publish RuiJi.Net.Cmd

dotnet publish RuiJi.Net.Cmd.csproj

  1. Go to the release folder and run the following command to start the project.

dotnet RuiJi.Net.Cmd.dll

Local mode

The local mode of RuiJi.Net is often used to demonstrate effects and test

  1. First make sure you have the ruiji.json configuration file in the run directory.
  2. The configuration of ruiji.json is as follows (these configurations are the default if you have not modified them)

{

  "setting": {

    "ruiJiServer": "localhost:36000",

    "docServer": "localhost:80"

  },

  "nodes": [

  ]

}

  1. Run dotnet RuiJi.Net.Cmd.dll

If you see the following message, congratulations, the startup is successful.

Hosting environment: Production

......

Now listening on: http://192.168.31.32:80

Application started. Press Ctrl+C to shut down.

Hosting environment: Production

......

Now listening on: http://192.168.31.32:36000

Application started. Press Ctrl+C to shut down.

2018-07-29 11:19:52,015 [1] INFO  - 192.168.31.32:36000 feed scheduler starting

2018-07-29 11:19:52,030 [1] INFO  - 192.168.31.32:36000 feed scheduler started

2018-07-29 11:19:52,033 [1] INFO  - Start WebApiServer At http://192.168.31.32:36000 with STANDALONE node

......

Enter the address suggested in the console in the browser. For example, if you enter  http://192.168.31.32:36000,  you will see a management page similar to the following:

http://www.ruijihg.com/wp-content/uploads/2018/06/4-3.png

Pseudo distribution mode

The pseudo distribution mode needs to use zookeeper. First install the following steps to install zookeeper.

  1. Visit  http://mirrors.hust.edu.cn/apache/zookeeper/zookeeper-3.4.12/
  2. Download the ZooKeeper archive and extract it to the running directory of RuiJi.Net

If you need to change the data directory of zookeeper, please modify the /conf/zoo.cfg dataDir under zookeeper.

# The number of milliseconds of each tick

tickTime=2000

# The number of ticks that the initial

# synchronization phase can take

initLimit=10

# The number of ticks that can pass between

# sending a request and getting an acknowledgement

syncLimit=5

# the directory where the snapshot is stored.

# do not use /tmp for storage, /tmp here is just

# example sakes.

dataDir=e:\\zookeeper\\data

#dataLogDir=e:\\zookeeper\\log

# the port at which the clients will connect

clientPort=2181

#

# Be sure to read the maintenance section of the

# administrator guide before turning on autopurge.

#

# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance

#

# The number of snapshots to retain in dataDir

#autopurge.snapRetainCount=3

# Purge task interval in hours

# Set to "0" to disable auto purge feature

#autopurge.purgeInterval=1

  1. Add zkPath to ruiji.json as follows, where value is the folder name of ZooKeeper in the RuiJi.Net running directory. If you have previously configured local mode, you can comment out ruiJiServer.
  2. Add the following configuration under the nodes node.

{

  "setting": {

    "zkPath": "zookeeper-3.4.12",

    "zkServer": "localhost:2181"

  },

  "nodes": [

     {

      "baseUrl": "localhost:36000",

      "type": "cp"

    },

    {

      "baseUrl": "localhost:37000",

      "type": "ep"

    },

    {

      "baseUrl": "localhost:38000",

      "type": "fp"

    },

    {

      "baseUrl": "localhost:36001",

      "type": "c",

      "proxy": "localhost:36000"

    },

    {

      "baseUrl": "localhost:36002",

      "type": "c",

      "proxy": "localhost:36000"

    },

    {

      "baseUrl": "localhost:37001",

      "type": "e",

      "proxy": "localhost:37000"

    },

    {

      "baseUrl": "localhost:37002",

      "type": "e",

      "proxy": "localhost:37000"

    },

    {

      "baseUrl": "localhost:38001",

      "type": "f",

      "proxy": "localhost:38000"

    },

    {

      "baseUrl": "localhost:38002",

      "type": "f",

      "proxy": "localhost:38000"

    }

  ]

}

  1. Start RuiJi.Net, you will see a lot of output logs, the program will start ZooKeeper first, and then start the corresponding node one by one according to the configuration file.

xxxx Start WebApiServer At http://192.168.31.196:37000 with ExtractorPROXY node

......

xxxx Start WebApiServer At http://192.168.31.196:36001 with CRAWLER node

......

please e to exit!

If you see the last one of the above information, you can open the browser and enter any node URL output in the console to access the RuiJi.Net management website.

In the pseudo-distributed and fully-distributed modes, the navigation menu of the management website appears with tabs for nodes and clusters.

http://www.ruijihg.com/wp-content/uploads/2018/06/3-3.png

Full distribution mode

The full distribution mode needs to be configured based on the pseudo-distribution mode. You need to deploy different nodes to different servers, and specify the server where ZooKeeper is located in appSettings. Suppose we need to deploy RuiJi.Net cluster on 4 machines. (represented by A, B, C, D), the first machine deploys all proxy nodes (cp, ep, fp) and ZooKeeper services, and the other three machines deploy crawl (c), extract (e) and news sources respectively. Update detection node (f)

  1. The configuration of the A server is as follows: A server IP is 192.168.101.10

{

  "setting": {

    "zkPath": "zookeeper-3.4.12",

    "zkServer": "localhost:2181"

  },

  "nodes": [

    {

      "baseUrl": "localhost:36000",

      "type": "cp"

    },

    {

      "baseUrl": "localhost:37000",

      "type": "ep"

    },

    {

      "baseUrl": "localhost:38000",

      "type": "fp"

    }

  ]

}

RuiJi.Net can not use self-managed ZooKeeper, so you can deploy ZooKeeper to any server, just specify the ZooKeeper server location for each node, you can also set one of the nodes to use self-managed ZooKeeper, other nodes ZooKeeper configured to use self-managed nodes. If your server has multiple IP addresses, it is recommended that you change all baseUrls under zkServer and nodeSettings to one of the IP addresses, so that other nodes will know exactly where the ZooKeeper server is when setting the ZooKeeper address. the address of.

  1. The configuration of the B server is as follows

{

  "setting": {

    "zkServer": "192.168.101.10:2181"

  },

  "nodes": [

    {

      "baseUrl": "localhost:36001",

      "type": "c",

      "proxy": "192.168.101.10:36000"

    },

    {

      "baseUrl": "localhost:36002",

      "type": "c",

      "proxy": "192.168.101.10:36000"

    }

  ]

}

  1. The configuration of the C server is as follows

{

  "setting": {

    "zkServer": "192.168.101.10:2181"

  },

  "nodes": [

    {

      "baseUrl": "localhost:37001",

      "type": "e",

      "proxy": "192.168.101.10:37000"

    },

    {

      "baseUrl": "localhost:37002",

      "type": "e",

      "proxy": "192.168.101.10:37000"

    }

  ]

}

  1. The configuration of the D server is as follows

{

  "setting": {

    "zkServer": "192.168.101.10:2181"

  },

  "nodes": [

    {

      "baseUrl": "localhost:38001",

      "type": "f",

      "proxy": "192.168.101.10:38000"

    },

    {

      "baseUrl": "localhost:38002",

      "type": "f",

      "proxy": "192.168.101.10:38000"

    }

  ]

}

  1. Confirm the firewall open related ports of all servers
  2. Start RuiJi.Net.Cmd.exe as an administrator on all machines and enter the management URL of any console output. You will see the same management page in pseudo-distribution mode. http://www.ruijihg.com/wp-content/uploads/2018/06/3-3.png

Node type and responsibilities of RuiJi.Net

Start

The nodes of RuiJi.Net are divided into six types, namely, the grabbing node, the crawling proxy node, the extracting node, the extracting proxy node, the news source monitoring node, and the news source proxy node.

The function of each node is as follows

Crawl node: Responsible for downloading the specified address source file

Fetch proxy node: Responsible for maintaining a list of available crawl servers and assigning crawl tasks

Extract nodes: Extract according to rules

Extract proxy node: Responsible for maintaining a list of available extract servers and assigning extraction tasks

Feed monitoring node: Responsible for regularly checking the feed update, forwarding the update address to the crawl node to download and save the final extraction result

Feed proxy node: Responsible for maintaining available feed monitoring nodes, recording and assigning feeds, matching rules based on address matching

RuiJi.Net cluster

http://www.ruijihg.com/wp-content/uploads/2018/05/2-2.png

The configuration file of RuiJi.Net is as follows, you need to configure this information in the config file.

{

  "setting": {

    "zkPath": "zookeeper-3.4.12",

    "zkServer": "localhost:2181",

    "ruiJiServer": "localhost:36000",

    "docServer": "localhost:80"

  },

  "nodes": [

    {

      "baseUrl": "localhost:36000",

      "type": "cp"

    },

    {

      "baseUrl": "localhost:37000",

      "type": "ep"

    },

    {

      "baseUrl": "localhost:38000",

      "type": "fp"

    },

    {

      "baseUrl": "localhost:36001",

      "type": "c",

      "proxy": "localhost:36000"

    },

    {

      "baseUrl": "localhost:36002",

      "type": "c",

      "proxy": "localhost:36000"

    },

    {

      "baseUrl": "localhost:37001",

      "type": "e",

      "proxy": "localhost:37000"

    },

    {

      "baseUrl": "localhost:37002",

      "type": "e",

      "proxy": "localhost:37000"

    },

    {

      "baseUrl": "localhost:38001",

      "type": "f",

      "proxy": "localhost:38000"

    },

    {

      "baseUrl": "localhost:38002",

      "type": "f",

      "proxy": "localhost:38000"

    }

  ]

}

Setting

name

Description

zkPath

Zookeeper path

zkServer

Zookeeper address

ruiJiServer

Local mode service address

docServer

Document server address

zkPath is used when using self-starting zookeeper, you can also not use self-managed zookeeper

AppSettings

RuiJi.Net can use self-managed ZooKeeper. If you specify zkPath, RuiJi.Net will automatically start ZooKeeper.

zkServer is used to tell the location of all nodes ZooKeeper server

 

RuiJi.Net Administrator UI

 

Introduction

RuiJi.Net provides a web-based management interface with the management interface address as the baseUrl of any node you configure in config. Through the management interface, you can observe the running status, log, and cluster status of the server, and you can enter the feed address to be monitored through the management interface and extract the rules. The management interface also provides a simple preview of the captured results. In the Settings tab, you can set some of the parameters required by RuiJi.Net.

http://www.ruijihg.com/wp-content/uploads/2018/07/1.png

The tabs in RuiJi.Net's management interface are divided into two types. The status and log display information about the current node. The cluster, feed, and crawl results are the same as those obtained from any node. RuiJi. Net forwards the message to the relevant node for processing through the node route.

Status

The Status tab displays the node type and status of the current node. Here you can observe the following.

  1. Node types, including: STANDALONE, CRAWLER, CRAWLERPROXY, EXTRACTOR, EXTRACTORPROXY, FEED, FEEDPROXY

In stand-alone mode, the type of node is STANDALONE

  1. Node startup time
  2. Node running framework and server hardware environment
  3. The version of the RuiJi.Net class library used by the node
  4. The consumption of node resources, including: memory, CPU usage, network card
  5. The author updated information on the project in the past month.

http://www.ruijihg.com/wp-content/uploads/2018/07/2-2.png

Log

The log tab displays the logs of the current node, including node startup logs, task scheduling information, and fetching, and extracting logs. The log tab's log only displays the most recent 1000. If you need to view more logs, please check it in the logs folder of the corresponding node's running directory.

http://www.ruijihg.com/wp-content/uploads/2018/07/4-1.png

Node

The Nodes tab is used to display the contents of ZooKeeper, including the tree structure of ZooKeeper and the information saved on each path of ZooKeeper. You can also view the IP range that the set crawl node can use and the range of feeds that the feed node needs to check. These two settings can be set under the Node for setting up the card.

This tab is visible in a pseudo-distributed, fully distributed mode

http://www.ruijihg.com/wp-content/uploads/2018/07/26.png

Cluster

The Cluster tab shows the operational status of each node in the RuiJi.Net cluster. Normal nodes are shown in bold, and nodes that are not started or down are represented in normal fonts. By clicking on the circle in front of the node, you can jump to the management interface of the relevant node.

This tab is visible in a pseudo-distributed, fully distributed mode

http://www.ruijihg.com/wp-content/uploads/2018/07/3-2.png

RuiJi.Net Administrator UI – Feeds

Feed

With the Feeds tab, you can add, modify, and query the feeds you need to crawl, and you can perform rule tests and crawl simulations directly.

In pseudo-distributed and fully-distributed mode, you can access the feed by accessing any node, and these operations are routed to the feed proxy node for processing.

http://www.ruijihg.com/wp-content/uploads/2018/07/5.pnghttp://www.ruijihg.com/wp-content/uploads/2018/07/6.png

Edit

Click the Add button or the edit button of the feed to enter the feed detailed editing interface.

http://www.ruijihg.com/wp-content/uploads/2018/07/27.png

In Address, you can fill in the address function and wrap it in {# #}. You will see an introduction to the address function in a later chapter. The function of the function shown in the figure is to scan the first two pages of the feed each time.

The Content-Type and Data parameters need to be set when the request method is POST.

The field attribute comparison table in Dialog is as follows

Field

description

characteristic

Site name

Feed site name

Support keyword search

Remark

Remarks

Support keyword search

Address

Address link

Support keyword search

Method

Request method

GET POST

Content-Type

Content type

application/x-www-form-urlencoded
application/json

Data

Request parameter

Set request parameters in different formats according to content type

UA

Browser User-Agent

If it is empty, use the UA set in Setting randomly. 
If there is no UA setting in Setting, the default is to use Ruiji.

Hearders

Request header

Split by carriage return

Genre

Feed type

Convenient for future source classification

Type

Address return value format

Record only, no practical use

RuiJi Exp

Ruiji expression

For details, please refer to the introduction of Ruiji expressions.

Delay

Delayed execution time

How long does it take to start downloading the target link after the feed update?

Scheduling

Scan interval

Cron expressions 
can be automatically generated or filled in with optional options.

Status

Whether to open

 

RunJs

Whether to execute page JS

If you select ON, 
RuiJi.Net will use the headless browser to access the target page.

Test

The Test button in Dialog can test the extraction result of your RuiJi expression, as shown below

If it is pseudo-distributed or fully distributed, please go to Setting > Node to set the available IP of the crawl node.

http://www.ruijihg.com/wp-content/uploads/2018/07/9.png

Download Target can download the target address and save it on the server in the corresponding format, as shown in the figure below.

http://www.ruijihg.com/wp-content/uploads/2018/07/10.png

RuiJi.Net accesses the feed through a fixed time interval and checks the update link address of the feed through an algorithm. The result of the feed selection must be the link address. If you don't do this, RuiJi.Net downloads the feed. Will not do anything

 

RuiJi.Net Administrator UI – Rules

Rule

The Rules tab allows you to add, modify, and query crawl page rules. http://www.ruijihg.com/wp-content/uploads/2018/07/11.png

Edit

Click the Add button or the rule's edit button to enter the rule detailed editing interface.

http://www.ruijihg.com/wp-content/uploads/2018/07/12.png

The field attribute comparison table is as follows:

Field

description

characteristic

Url

Extract the original address of the rule

Support keyword search for future reference

Expression

Address match expression

Support keyword search, wildcard * represents multiple characters, ? represents a character

Method

Request method

GET POST

UA

Browser User-Agent

Random when used

Hearders

Request header

Split by carriage return

Feature

Interface characteristics

When there are multiple rules in an interface, you need to select rules according to the interface characteristics.

Type

Address return value format

Record only, no practical use

Ruiji Exp

Ruiji expression

For details, please refer to the introduction of Ruiji expressions.

Status

Whether to open

 

RunJs

Whether to execute page JS

If you select ON, RuiJi.Net will use the headless browser to access the target page.

Test

Click the Test button to test the extraction results, as shown below

If it is pseudo-distributed or fully distributed, please go to Setting > Node to set the available IP of the crawl node.

http://www.ruijihg.com/wp-content/uploads/2018/07/13.png

RuiJi.Net Administrator UI – Node Settings

 

The node settings in the Settings tab are used to set the scope of the feed that the feed node needs to check and the range of IPs that the crawl node can use.

This tab is visible in a pseudo-distributed, fully distributed mode

Feed scope

Set the feed range detected by the feed node, according to the ID order, 50 pages per page. For example, if the node Feed1 input is set to 1, 3, then when the feed is checked, the node Feed1 will take out the first 50 and 101-150 records in the record, and a total of 100 feeds will be checked.

Multiple nodes cannot repeatedly detect the same page number

Crawl node IP range

Set the IP address range that the crawling node can use when crawling the page. When crawling data, IP polling will be performed according to the specified IP.

Please be sure to set at least one available IP for each node when starting up for the first time. If there is no setting, you will not be able to download any content normally.

http://www.ruijihg.com/wp-content/uploads/2018/07/15.png

RuiJi.Net Administrator UI – Function Settings

Function

Functions in RuiJi.Net can be used to process Url addresses or selector results. There are two types of functions: URLFUNCTION and SELECTORPROCESSOR.

http://www.ruijihg.com/wp-content/uploads/2018/07/16.png

Add

Click the Add button or the function's edit button to enter the function editing interface.

http://www.ruijihg.com/wp-content/uploads/2018/07/17.png

The field attribute comparison table in Dialog is as follows

Field

description

characteristic

Name

Function name

Used for calling

Code

Specific execution code

Temporarily only supports C# code writing

Type

Function type

Please refer to URLFUNCTION and SELECTORPROCESSOR for details.

Smaple

Use example

 

Test

Click the Test button to test the function results based on your usage examples. As shown below

http://www.ruijihg.com/wp-content/uploads/2018/07/19.png

RuiJi.Net Administrator UI – Function Type

 

URL FUNCTION

URLFUNCTION is used to handle the Url address, you can use the function in the Url address like this

http://xxx.xxx.com.cn/roll.php?do=query&callback=jsonp1475197217819&_={# ticks() #}&date={# now("yyyy-MM-dd") #}&size=20&page={# page(1,2) #}

Use a function in the address, the function must be in {# #}

Ticks is a function that generates a timestamp. The now function generates a date parameter based on the current date according to the format passed. The page is used to generate a page turn. To run this function, you may get the following link address.

http://xxx.xxx.com.cn/roll.php?do=query&callback=jsonp1475197217819&_=1475197217&date=20180708&size=20&page=1

http://xxx.xxx.com.cn/roll.php?do=query&callback=jsonp1475197217819&_=1475197217&date=20180708&size=20&page=2

The page function is defined as follows

for (int i = {0}; i <= {1}; i++){{results.Add(i);}}

Results is the output of the function, it is an array, you can let your function return multiple values, when the address has multiple functions, RuiJi.Net will calculate their results in order, each group of results or as a next group The input of the function calculation, assuming that you use two address functions in the address, and each of them returns two sets of results, then the final calculated address is 2*2=4

RuiJi.Net has several built-in functions, the source code is as follows

ticks

results.Add(DateTime.Now.Ticks);

Example: ticks() Purpose: Generate random numbers

page

for (int i = {0}; i <= {1}; i++){{results.Add(i);}}

Example: page(1,10) Purpose: Generate an address from 1 to 10 pages

limit

for (int i = {0}; i <= {1}; i++){{results.Add((i-1)*{2});}}

Example: limt(1,10,20) Purpose: Generate 1 to 10 pages of address, span 20 per page

now

results.Add(DateTime.Now.ToString("{0}"));

Example: now("yyyy-MM-dd") Purpose: Format the current date according to the format passed in

Selector function

The function selector is used to use the standard selector provided by RuiJi.Net, but still can not meet the extraction requirements, for example: we often retrieve the date, the result may be a timestamp or incomplete date It is also possible to encounter a return a few minutes ago or a few days ago, the results returned are not what we expected, this time you can use SELECTORPROCESSOR to handle such results.

For example, we can define a function name as abc, the content is as follows

if (content.EndsWith("小时前"))

{{

var hour = Convert.ToInt32(Regex.Match(content, @"[\d]*").Value);

results.Add(DateTime.Now.AddHours(-hour));

}}

 

if (content.EndsWith("天前"))

{{

var hour = Convert.ToInt32(Regex.Match(content, @"[\d]*").Value);

results.Add(DateTime.Now.AddDays(-hour));

}}

 

if (content.EndsWith("AddMinutes"))

{{

var hour = Convert.ToInt32(Regex.Match(content, @"[\d]*").Value);

results.Add(DateTime.Now.AddDays(-hour));

}}

The code is implemented in C#. If you only use the core class library of RuiJi.Net.Core, you need to store this code in the funcs folder of the executable directory as the function name .pro, if you use RuiJi.Net as a whole. If the project is managed, you can also manage the function selector in the management interface.

The content that appears in the code is the result of a selector on the current function selector. With the custom function selector, you can more accurately clean the inaccurate data according to your extraction needs.

You can use this in a RuiJi expression.

[block]

#BlockName

css #topsOfRecommend:ohtml

 

[tile]

#titlename

css .box-aw

 

[meta]

......

 

#postdate

css .blog-footer-box > span:eq(2):text

proc abc

We will also increase the function selector using js in the future.

 

RuiJi.Net Administrator UI – Proxy

Proxy

When Ruiji.Net requests the download page, it will simulate the User Agent (UA), and use the Cookie Manager to generate multiple cookies for download according to the UA settings. Ruiji.Net's UA settings are mainly divided into UA group settings and UA settings. The UA group represents different PC browsers or mobile browsers, and the UA is a different UA under a certain UA group. Through this setting interface, you can add, delete, and change the UA group and its subordinate UA. It should be noted here that deleting a UA group also deletes all UAs it belongs to.

http://www.ruijihg.com/wp-content/uploads/2018/07/20.png

Add

Click the Add button or select a UA group and click the Update button to enter the UA group editing interface.

http://www.ruijihg.com/wp-content/uploads/2018/07/21.png

Click the Add button or click the edit button of a UA to enter the UA editing interface.

http://www.ruijihg.com/wp-content/uploads/2018/07/22.png

The field attribute comparison table in Dialog is as follows

Field

description

characteristic

Group

UA group name

 

Name

UA name

 

Value

UA specific value

 

Count

Generate the number of cookies

The cookie manager in Ruiji.Net will generate this number of cookie values ​​based on this value 
for cookie polling when using this UA.

Test

In the process of crawling, Ruiji.Net will take a multi-IP polling system to prevent it from being blocked when requesting a page. IP proxy settings are available for IP polling when requesting pages. The agent can also be tested in time on this interface. Due to the instability of the agent, the results of each test may be different.

http://www.ruijihg.com/wp-content/uploads/2018/07/23.pnghttp://www.ruijihg.com/wp-content/uploads/2018/07/24.png

Edit

Click the Add button or click the edit button of an agent to enter the agent editing interface.

http://www.ruijihg.com/wp-content/uploads/2018/07/25.png

The field attribute comparison table in Dialog is as follows

Field

description

characteristic

Ip

Proxy IP

 

Port

port

 

UserName

Proxy login account

 

Password

Proxy login password

 

Type

Agent type

HTTP HTTPS

Status

Whether to open

 

 

RuiJi.Net extraction model

Structure

RuiJi.Net structured the targets to be extracted. Each target page to be extracted is divided into the following structures: Block, Tile, and Meta. This is called a decimator in RuiJi.Net.

http://www.ruijihg.com/wp-content/uploads/2018/06/1-3.png

Selector

Each RuiJi.Net decimator contains Selectors, which are selectors that are used to define what the decimator needs to extract. Selectors are composed of Selectors. Each Selector depends on the result of the previous Selector, that is, the next Selecor will extract more fine content from the previous Selector.

If the decimator does not define a Selector, the decimator's extracted content will be returned by default to return the entire document content or the parent decimator's extraction result.

http://www.ruijihg.com/wp-content/uploads/2018/07/2.png

Block extractor

Block is the most basic unit in the RuiJi.Net extraction model. The Block is responsible for locating the extraction area. The Tile and Meta under the Block will be extracted in the Block extraction result. The Selectors under the Block are the selectors of the Block Extractor.

http://www.ruijihg.com/wp-content/uploads/2018/07/bb647a1f149e9887c5858c1f90725945a53.png

The Block Extractor can also contain multiple Blocks, which are represented by Blocks in the Block.

The reason for using Block to extract regions is that the source page may contain multiple repeat regions, and we are only interested in some of them, so that we can extract only the regions of interest and ignore other regions. As shown above, maybe we are only interested in the latest recommendations, and other today's hotspots, this week's hotspots, etc. we don't need to extract.

Tile extractor

Tile is a block that is repeated under the block. It is usually used to extract the source page of the list class. The Selectors under the Tile are used to describe the content blocks that need to be extracted repeatedly. Usually, the Tile extractor selects multiple results.

http://www.ruijihg.com/wp-content/uploads/2018/07/3.png

Meta extractor

The Meta Extractor can be used under Tile and Block. When Tile has Meta, Meta is used to extract the metadata that needs to be extracted in the Tile Repeat result, which is usually used to extract list information. When Block has Meta, Meta is used to extract the metadata that needs to be extracted in the Block, which is usually used to extract the metadata of the detailed page.

When Tile has Meta, the result of Meta extraction is usually multiple groups.

http://www.ruijihg.com/wp-content/uploads/2018/07/3-1.png

When Block has Meta, the result of Meta extraction is usually a group

http://www.ruijihg.com/wp-content/uploads/2018/07/4.png

RuiJi.Net selector type

RuiJi.Net's selector is used to extract the content of the structure, usually one or more. The next selector handles the processing result of the previous selector. By layer-by-layer processing, we can Extract the results for fine extraction to achieve the final desired extraction results.

RuiJi.Net's selectors have the following types

Types of

Description

CSS

Style selector, similar to JQuery

REGEX

Regular selector

REGEXSPLIT

Split selector to support regular segmentation

TEXTRANGE

Text area selector

EXCLUDE

Exclusion selector

REGEXREPLACE

Replacement selector

JPATH

JSON selector

XPATH

Xpath selector for handling xml documents

CLEAR

Clear selector, clean up html tags

EXPRESSION

Expression selector to match the address

SELECTORPROCESSOR

Function selector that handles selector results through custom functions

Css selector

The class library used by the css selector is CsQuery. CsQuery provides a JQuery-like method for handling html pages via the css selector. In RuiJi.Net, the css selector is usually used as the first selector of the Selectors to locate the selection area.

Regex selector

The regex selector uses regular expressions to extract content, and in RuiJi.Net you can extract expression results or group results by configuration.

RegexSplit selector

You can specify multiple results by dividing the string with a regular expression and extracting the result of the specified index.

TextRange selector

A text area selector that extracts the content in the middle of the start and end text areas by defining a string starting with the text and a string ending with the text.

Exclude selector

The exclude selector is used to exclude the specified text content. What needs to be excluded is defined by a regular expression.

RegexReplace selector

The regular replacement selector is used to replace the matched result with the target result.

JPath selector

The JsonPath selector is used to process documents in Json format.

XPath selector

XPath selector is used to process XML documents

Clear selector

The Clear selector will automatically clear the tags of some Html source files, including: script, style, iframe, input, textarea, select, form, and comments.

Expression selector

Expression selectors typically use wildcards to extract the required link address.

SELECTORPROCESSOR selector

The SELECTORPROCESSOR selection allows the user to call externally defined functions to process some special extraction results, such as the extracted time is xx minutes ago.

RuiJi.Net RuiJi expression introduction

RuiJi expression is a way to quickly add extraction rules and separate the rules from the running of the program. The RuiJi expression is as simple, easy to understand and flexible to configure.

The RuiJi expression follows the extraction model of RuiJi.Net, and the extraction structure described by the expression is consistent with the extraction model of RuiJi.Net.

RuiJi expressions can be stored in text files, databases, or caches and read when they need to be extracted, which means you don't need to restart the program to change the expression content at any desired time.

In the future we will let each extraction node cache the RuiJi expression and receive the change notification when needed to update the rules.

As mentioned above, the RuiJi expression follows the extraction structure of RuiJi.Net. The extractor in RuiJi.Net contains Block, Tile, and Meta, and uses [block], [tile], and [meta] in the RuiJi expression.

One of the simplest extraction expressions is defined as follows

[block]

#recommend

css #topsOfRecommend

This expression defines a decimator whose name is recommended, the decimator defines a css selector, and the outerHtml whose id is topsOfRecommend is selected as the final extract of the block extractor.

The extractor can contain names, but it is not required for blocks and tiles. If you need to define the name of the extractor, the name description must follow the extractor (new line) and start with #.

Block with tiles

[block]

#recommend

css #topsOfRecommend:ohtml

 

[tile]

#tile

css .box-aw

Here the block css selector is added at the end: ohtml, which, like the previous example, selects the outerHtml with the id topsOfRecommend as the final extract of the block extractor. The first example here is abbreviated. In the case where the selection result is dom, if no selection suffix is ​​specified, outerHtml is used as the selection result by default. Available suffixes also include html, text.

The content extracted by the tile extractor is the result of the block extractor, which is the outerHtml with the id of topsOfRecommend. In this result, select the dom of the style .box-aw as the result of the tile selector. The result is one or more.

Continue to extract metadata in the tile results

[block]

#BlockName

css #topsOfRecommend:ohtml

 

[tile]

#titlename

css .box-aw

 

[meta]

#title

css .blog-title-link[title]

 

#author

css .blog-footer-box > span:first:text

 

#postdate

css .blog-footer-box > span:eq(2):text

 

#reads_i

css .blog-footer-box > span:last:text

regS / / 1

The meta here will be indented backwards to indicate that the decimator is a sub-decimator of the tile. If not, the decimator is the decimator of the block. The unit that is indented backwards is a tab.

The meta can select multiple sets of data and must have a name, requiring a blank line between each element that needs to be extracted. In this example, meta is the result of each extraction of the tile, and the result of the meta is a dictionary.

Maybe in the future we will provide a compiler for RuiJi expressions so that everyone can enter RuiJi expressions.

 

RuiJi.Net RuiJi expression advanced

Multiple selectors

Here we take the name of the column as an example, assuming the source code of a page is as follows

<td>

    <div style="width:616px; float:left;" class="f12 black">

<ul style="margin:0; padding:0;">

  <li style="float:left; width:120px; text-align:right;">法制网首页&gt;&gt;</li>

  <li style="float:left; width:350px; text-align:left;">

              <span style="padding:5px 0px 5px 15px;">

        <a href="../../../node_34228.htm" target="_blank" class="f12 black">评论频道</a>

                <font class="f12 black">&gt;&gt;</font>

                <a href="../../../node_34252.htm" target="_blank" class="f12 black">法治时评</a>

      </span>

            </li>

</ul>

    </div>

</td>

We use the following RuiJi expression to extract the column, and need to remove the text of the legal system home page >>, then the selector definition can be as follows

[meta]

#railling

css div.f12:text

ex /\s+法制网首页>>/ -b

regR />>/ >

Here the first selector first selects the text of div.f12, the result is as follows

 法制网首页>> 评论频道>>法治时评

Then use the exclusion selector to exclude the legal network at the beginning of the text >> (-b is the meaning of excluding the start text)

 评论频道>>法治时评

Again we need to replace >> with > the final result is as follows

评论频道>法治时评

Of course, we can also extract the columns like this.

[meta]

#railling

css div.f12 span:text

regR />>/ >

This is consistent with the above results.

Type conversion

RuiJi.Net's extractor can convert the result of the selector processing to the target data type. The type conversion is done by adding a suffix to the extractor name. The following list is the suffix of the extractor name and its conversion type.

suffix

Target conversion type

*_i

int

*_s

string

* _l

Long

*_b

bool

*_f

float

*_d

double

*_DT

datetime

If you do not specify a suffix, the decimator will return the string type by default. If the conversion has an exception, the decimator will also return the string type.

The following is a simple example

[meta]

#title

css .blog-title-link[title]

 

#author

css .blog-footer-box > span:first:text

 

#postdate_dt

css .blog-footer-box > span:eq(2):text

 

#reads_i

css .blog-footer-box > span:last:text

regS / / 1

Paging extractor

The Paging Extractor is a special extractor that is used to extract the paging of the page. The extraction result of the Paging Extractor must be the link address. The Paging selector will automatically extract the page data in the order in which the links appear, and will be in [meta]. The content fields are merged, and the Paging Extractor is typically used for detailed pages with pagination.

[block]

 

......

 

[meta]

......

 

#content

css .a-con:ohtml

 

[paging]

css .a-page

css a[href]

You need to ensure that the page you are extracting is the first page of the page, otherwise the merged result of the page may be incorrect.

RuiJi.Net RuiJi expression selector

 

Css selector

expression

Description

css tag[xxx]

Select attribute

css tag:text

Select text

css tag: ohtml

Choose outerHtml

css tag:html

Select innerHtml

css dd[class=’f12 balck02 yh’] + dd:text

Select the text of dd with the tag dd and class 'f12 balck01 yh'

Exclude exclusion selector

expression

Description

ex /abc/ -b

Start with text

ex /abc/ -a

Exclude anywhere in the text

ex / abc / -e

Exclude at the end of text

/abc/ is a regular expression, the following string starting with / and ending with / means regular

Expresssion wildcard selector (for URL extraction only)

expression

Description

exp http://www.ruijihg.com/*

Use wildcards to match any URL starting with http://www.ruijihg.com/

exp http://www.ruijihg.com/???

Use wildcard matching to start with http://www.ruijihg.com/ followed by a 3-character URL

Regex regular selector

expression

Description

reg /abc/

Match regular results

reg /abc(.*)/ 1

Match the regular grouping result and take the nth result

RegexSplit split selector

expression

Description

regS /abc/ 3

Take the nth result with a regular split string

RegexReplace replacement selector

expression

Description

regR /abc/ 123

Replace matching regular results

TextRange text area selector

expression

Description

text /abc/ /edf/

Take the string starting with /abc/ and ending with /edf/

XPath selector

expression

Description

xpath /bookstore/book[1]

xpath

JsonPath selector

expression

Description

jpath $..url

jsonpath

SELECTORPROCESSOR selector

expression

Description

proc name

Execute the function named name