Vote #80155: robots.txt: disallow crawling dynamically generated PDF documents - redmineorg-copy202205 - Redmine

編集操作

リンクをコピー

Vote #80155

完了

robots.txt: disallow crawling dynamically generated PDF documents

Admin Redmine さんが3年以上前に追加. 3年以上前に更新.

ステータス:

Closed

優先度:

通常

担当者:

カテゴリ:

SEO_48

対象バージョン:

4.2.0_152

開始日:

2022/05/09

期日:

進捗率:

予定工数:

Redmineorg_URL:

https://www.redmine.org/issues/31617

category_id:

version_id:

152

issue_org_id:

31617

author_id:

405544

assigned_to_id:

332

comments:

status_id:

tracker_id:

plus1:

affected_version:

closed_on:

affected_version_id:

ステータス-->[Closed]

引用

説明

While the auto-generated robots.txt contains URLS for /issues (the HTML issue list), it doesn't contain the same URLs for the PDF version.

At osmocom.org (where we use redmine), we're currently seeing lots of robot requests for /projects/*/issues.pdf?.... as well as /issues.pdf?....

journals

I'm sorry, it seems the robot.txt standard is using sub-string matching, so foo/issues should include foo/issues.pdf. The crawler we see seems to be ignoring that :(
--------------------------------------------------------------------------------
Thank you for the feedback. Closing.
--------------------------------------------------------------------------------
The robots.txt generated by Redmine 4.1 does not disallow crawlers to access "/issues/<id>.pdf" and "/projects/<project_identifier>/wiki/<page_name>.pdf".

I think the following line should be added to the robots.txt.

<pre>
Disallow: *.pdf
</pre>
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Since dynamically generated PDFs contain no more information than HTML pages and are useless for web surfers, the PDFs should not be indexed by search engines. In addition, In addition, generating a large number of PDFs in a short period of time is too much burden for a server.

I suggest disallowing web crawlers to fetch dynamically generated PDFs such as /projects/*/wiki/*.pdf and /issues/*.pdf by applying the following patch. The patch still allows crawlers to fetch static PDF files attached to issues or wiki pages (/attachments/*.pdf).

<pre><code class="diff">
diff --git a/app/views/welcome/robots.text.erb b/app/views/welcome/robots.text.erb
index 6f66278ad..9cf7f39a6 100644
--- a/app/views/welcome/robots.text.erb
+++ b/app/views/welcome/robots.text.erb
@@ -10,3 +10,5 @@ Disallow: <%= url_for(issues_gantt_path) %>
Disallow: <%= url_for(issues_calendar_path) %>
Disallow: <%= url_for(activity_path) %>
Disallow: <%= url_for(search_path) %>
+Disallow: <%= url_for(issues_path(:trailing_slash => true)) %>*.pdf$
+Disallow: <%= url_for(projects_path(:trailing_slash => true)) %>*.pdf$
</code></pre>

--------------------------------------------------------------------------------
Setting the target version to 4.2.0.
--------------------------------------------------------------------------------
Committed the patch.
--------------------------------------------------------------------------------

related_issues

relates,New,3661,Configuration option to disable pdf creation of issues
relates,Closed,6734,robots.txt: disallow crawling issues list with a query string