プロジェクト

全般

プロフィール

Vote #80155

完了

robots.txt: disallow crawling dynamically generated PDF documents

Admin Redmine さんが3年以上前に追加. 3年以上前に更新.

ステータス:
Closed
優先度:
通常
担当者:
-
カテゴリ:
SEO_48
対象バージョン:
開始日:
2022/05/09
期日:
進捗率:

0%

予定工数:
category_id:
48
version_id:
152
issue_org_id:
31617
author_id:
405544
assigned_to_id:
332
comments:
8
status_id:
5
tracker_id:
2
plus1:
0
affected_version:
closed_on:
affected_version_id:
ステータス-->[Closed]

説明

While the auto-generated robots.txt contains URLS for /issues (the HTML issue list), it doesn't contain the same URLs for the PDF version.

At osmocom.org (where we use redmine), we're currently seeing lots of robot requests for /projects/*/issues.pdf?.... as well as /issues.pdf?....


journals

I'm sorry, it seems the robot.txt standard is using sub-string matching, so foo/issues should include foo/issues.pdf. The crawler we see seems to be ignoring that :(
--------------------------------------------------------------------------------
Thank you for the feedback. Closing.
--------------------------------------------------------------------------------
The robots.txt generated by Redmine 4.1 does not disallow crawlers to access "/issues/<id>.pdf" and "/projects/<project_identifier>/wiki/<page_name>.pdf".

I think the following line should be added to the robots.txt.

<pre>
Disallow: *.pdf
</pre>
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Since dynamically generated PDFs contain no more information than HTML pages and are useless for web surfers, the PDFs should not be indexed by search engines. In addition, In addition, generating a large number of PDFs in a short period of time is too much burden for a server.

I suggest disallowing web crawlers to fetch dynamically generated PDFs such as /projects/*/wiki/*.pdf and /issues/*.pdf by applying the following patch. The patch still allows crawlers to fetch static PDF files attached to issues or wiki pages (/attachments/*.pdf).

<pre><code class="diff">
diff --git a/app/views/welcome/robots.text.erb b/app/views/welcome/robots.text.erb
index 6f66278ad..9cf7f39a6 100644
--- a/app/views/welcome/robots.text.erb
+++ b/app/views/welcome/robots.text.erb
@@ -10,3 +10,5 @@ Disallow: <%= url_for(issues_gantt_path) %>
Disallow: <%= url_for(issues_calendar_path) %>
Disallow: <%= url_for(activity_path) %>
Disallow: <%= url_for(search_path) %>
+Disallow: <%= url_for(issues_path(:trailing_slash => true)) %>*.pdf$
+Disallow: <%= url_for(projects_path(:trailing_slash => true)) %>*.pdf$
</code></pre>

--------------------------------------------------------------------------------
Setting the target version to 4.2.0.
--------------------------------------------------------------------------------
Committed the patch.
--------------------------------------------------------------------------------


related_issues

relates,New,3661,Configuration option to disable pdf creation of issues
relates,Closed,6734,robots.txt: disallow crawling issues list with a query string

Admin Redmine さんが3年以上前に更新

  • カテゴリSEO_48 にセット
  • 対象バージョン4.2.0_152 にセット

他の形式にエクスポート: Atom PDF

いいね!0
いいね!0