Help / Source types for indexing

Several data types can serve as source data for indexing a website. You can choose any or a combination of sources described below.
For example. indexing a whole site by downloading its HTML pages and updating the index by using its RSS feed is a usual practice.
We recommend that you look through the section below to pick up most acceptable combinations and learn pros and cons of these data sources.

 

Downloading HTML-pages

Your website URL is in the Data Source box.

This is a most simple data source usable for indexing. Quintura downloader first reads your robots.txt, if any, then considering the restrictions from robots.txt it scans the starting page for links, and then newly found pages for further links and so on. The downloader never downloads the folders leveled upper than the folder specified in the Data Source box, nor does it download alien pages.

After the found pages are downloaded, Quintura indexer downloads their textual data and compiles an index used for searching. Currently we restrict the number of downloaded pages to 5,000. Quintura can withdraw the restriction at your written request sent to support@quintura.com.

Important!

If your sitemap is registered in your robots.txt (for more info on sitemaps, see below), Quintura downloader will refer to it and start downloading your site by using the sitemap. Other restrictions from robots.txt are also considered.

Optimizing

To enhance the cloud, special meta tags are used on web pages. Most useful are meta tags unique for each page or a set or pages from the same section. Your users will get a chance to find just what you want them to find for a given search request. Meta tags get higher priority against common words from texts. More details…

Advantages

The virtue is simplicity for the users of this data source: enter the URL of their site and wait till the cloud is made. However this data source contains a handful of disadvantages.

Disadvantages

  1. All pages detected by following the links including none and low-informative “service” – non-desirable advertising pages creep into your index. And although Quintura indexer is an intelligent tool and gets rid of the superfluous information automatically, we recommend that you use more informative data sources (see below).
  2. HTML updating is far less frequent than RSS updating (on a weekly against hourly basis). Hence, this type of data is not quite suitable to those sites whose content is updated often.

 

Sitemaps

The Data Source box contains a site or site section URL, a sitemap or several sitemaps being mentioned in robots.txt.

Sitemaps are used by a majority of search engines. A sitemap is an XML file containing information of pages that are important for a given site and will be interesting for searching. In this case, Quintura crawler also downloads HTML pages but filters these according to the sitemap.

Where to mention

You can make crawlers aware of your sitemap by mentioning it in your …/robots.txt file.

How to create

If your site is devoid of a sitemap you can create one by following the instructions at http://www.sitemaps.org/protocol.php.

 

RSS (or ATOM) feeds

Your RSS feed is in the Data Source box.

An RSS feed regularly contains only the latest content added to your site. The information is contained in a compact XML file and is structurally subdivided into URLs, titles, and annotations. The format is most suitable for updating your site index (and hence, your cloud). RSS feeds on blogs can also be used for compiling initial clouds. Some blog platforms support adding almost unlimited number of entries for RSS feeds. For example, the max-results parameter for yoursite.blogspot.com blogs: http://yoursite.blogspot.com/feeds/posts/default?max-results=1000.

Advantages

  1. Your index is updated on an hourly basis, so every hour your latest content is added. For more frequent updates, contact us at support@quintura.com.
  2. The time required to update your index is far less than what is needed to update your HTML pages.
  3. RSS usually doesn’t contain any superfluous data: advertising, regular blocks, clutter. Therefore a cloud based on your RSS feed is more helpful than a typical HTML-cloud.

Disadvantages

RSS annotations serving as a source of textual information usually contain only an abstract and do not contain the details. Therefore a lot of words do not that could get into index from the full article are missing in your cloud. Here is a tip how you can remedy the problem: Add your full texts as a <content> tag to your RSS or by creating QXML for your site.

Supportable RSS protocols

In addition to the standard RSS protocol we support several most popular RSS and ATOM variants. And though Quintura RSS downloader allows some non-standard tag names and deviation in their positioning, a substantial deviation from the standard protocol may result in processing troubles. This is why we recommend that you stick to the standard.
You can also use QXML, a proprietary Quintura standard, instead of RSS.

 

QXML (Quintura XML)

QXML is XML compatible with Quintura. You can convert your site content to QXML to build the most high-quality cloud.
Files of this format can be used both for initial indexing and for index updating. More details…