This week I integrated Apache Tika into Moodle to support indexing of Rich Documents like
.PPT etc. Solr’s ExtractingRequestHandler uses Tika allowing users to upload binary files to Solr and have Solr extract text from it and then index it, making them searchable.
One has to send the file to Solr via
HTTP POST. The following
cURL request does the work:
curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F "firstname.lastname@example.org"
ps.pdffile is sent to Solr to extract content from it.
literal.id=1: assigns the
id=1to the Solr Document thus created.
commit=true: Commits the changes to the solr index.
email@example.com: This needs to be a valid relative or absolute path.
Refer the wiki for more options on
Now, using the
PECL PHP SOLR client in Moodle, there isn’t a way to get the extracted content and add it to solr document’s field. The
cURL request creates an all-new Solr Document specifically for the files and adds content to that Solr Document’s fields.
get_content_file_location() function of Moodle that stores the absolute filepath of files is protected.
So, keeping the above things in mind I had to come up with the following logic for including the feature of indexing Rich Documents via
ExtractingRequestHandler in Global Search.
The access rights will be checked by extracting the
$id of the Solr Document and passing it to the forum’s access check function. Full code.
And, here’s the code that I’ve written for the Forum Module.
The above code sends the external files to Solr for extracting content and creating new Solr Documents. I’m not committing the Documents after each
cURL request as it would take a lot of time. Hence, after all the documents have been added, I’m execute
$client->commit at the end.