Vote #80155
完了robots.txt: disallow crawling dynamically generated PDF documents
0%
説明
While the auto-generated robots.txt contains URLS for /issues (the HTML issue list), it doesn't contain the same URLs for the PDF version.
At osmocom.org (where we use redmine), we're currently seeing lots of robot requests for /projects/*/issues.pdf?.... as well as /issues.pdf?....
journals
I'm sorry, it seems the robot.txt standard is using sub-string matching, so foo/issues should include foo/issues.pdf. The crawler we see seems to be ignoring that :(
--------------------------------------------------------------------------------
Thank you for the feedback. Closing.
--------------------------------------------------------------------------------
The robots.txt generated by Redmine 4.1 does not disallow crawlers to access "/issues/<id>.pdf" and "/projects/<project_identifier>/wiki/<page_name>.pdf".
I think the following line should be added to the robots.txt.
<pre>
Disallow: *.pdf
</pre>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Since dynamically generated PDFs contain no more information than HTML pages and are useless for web surfers, the PDFs should not be indexed by search engines. In addition, In addition, generating a large number of PDFs in a short period of time is too much burden for a server.
I suggest disallowing web crawlers to fetch dynamically generated PDFs such as /projects/*/wiki/*.pdf and /issues/*.pdf by applying the following patch. The patch still allows crawlers to fetch static PDF files attached to issues or wiki pages (/attachments/*.pdf).
<pre><code class="diff">
diff --git a/app/views/welcome/robots.text.erb b/app/views/welcome/robots.text.erb
index 6f66278ad..9cf7f39a6 100644
--- a/app/views/welcome/robots.text.erb
+++ b/app/views/welcome/robots.text.erb
@@ -10,3 +10,5 @@ Disallow: <%= url_for(issues_gantt_path) %>
Disallow: <%= url_for(issues_calendar_path) %>
Disallow: <%= url_for(activity_path) %>
Disallow: <%= url_for(search_path) %>
+Disallow: <%= url_for(issues_path(:trailing_slash => true)) %>*.pdf$
+Disallow: <%= url_for(projects_path(:trailing_slash => true)) %>*.pdf$
</code></pre>
--------------------------------------------------------------------------------
Setting the target version to 4.2.0.
--------------------------------------------------------------------------------
Committed the patch.
--------------------------------------------------------------------------------
related_issues
relates,New,3661,Configuration option to disable pdf creation of issues
relates,Closed,6734,robots.txt: disallow crawling issues list with a query string