Creating spdx-tv output sometimes fails with an exception #448

sschuberth · Jan 12, 2017

Not sure yet what's going on, but I'd like to document it here. For me running

scancode --diag --timeout 180 -n 2 -f spdx-tv report.spdx

might end up with

  File "/home/jenkins/jobs/scancode/workspace/tools/scancode/src/scancode/cli.py", line 335, in scancode
    save_results(files_count, results, format, input, output_file)
  File "/home/jenkins/jobs/scancode/workspace/tools/scancode/src/scancode/cli.py", line 700, in save_results
    file_entry.chk_sum = Algorithm('SHA1', file_entry.calc_chksum())
  File "/home/jenkins/jobs/scancode/workspace/tools/scancode/local/lib/python2.7/site-packages/spdx/file.py", line 151, in calc_chksum
    with open(self.name, 'rb') as file_handle:
IOError: [Errno 21] Is a directory: u'bin'

@pombredanne In Algorithm('SHA1', file_entry.calc_chksum()) the file_entry comes from here, so it looks like file_data['path'] might contain the path to a directory instead of to a file. Is that correct / by design?

pombredanne · Jan 12, 2017

@sschuberth yes, by design directories are also returned in a scan. Rather than letting the spdx library compute the sha1 with file_entry.calc_chksum(), you should IMHO get the info scan always and use the computed sha1 that's there. It will be empty for directories.

sschuberth · Jan 12, 2017

I agree it makes sense to reuse an existing SHA1 for a file if it's known, but my understanding is that it's not known unless you pass -i to ScanCode, which we don't do currently to reduce the scanning time. Or am I mistaken?

pombredanne · Jan 12, 2017

Actually the file information are always collected, in particular because they are used for cache handling. Other type and related information are also always computed as they are by various scans too. So a sha1 is always computed whether or not you ask for it in the scan with -i and only returned in the scan if you asked for the --info ... https://github.com/nexB/scancode-toolkit/blob/63c09f977bd74d3e6f2d402daf314823a0ffb3f1/src/scancode/cli.py#L465 is where this is always collected.

sschuberth · Jan 12, 2017

So a sha1 is always computed whether or not you ask for it in the scan

Ah! That sort of explains why the scan is so slow ;-P

pombredanne · Jan 12, 2017

I am pretty sure the impact of a SHA1 on scanning times is pretty small. In particular this allows to cache and stream results to support multiprocessing and a side effect of caching is that a file is scanned only once in a codebase that would contain multiple times the same file.

Though it could be worth measuring it of course.

sschuberth · Jan 12, 2017

So, I tried to read the SHA1 from the cache, but it fails for know, see PR #449. The file name is not found in the hash table, it seems:

  File "scancode-toolkit/src/scancode/cli.py", line 702, in save_results
    file_entry.chk_sum = Algorithm('SHA1', cache.get_info(file_entry.name).get('sha1'))
AttributeError: 'NoneType' object has no attribute 'get'

Would you mind having a look to point me into the right direction?

pombredanne added bug GUI and outputs labels Jan 12, 2017

sschuberth added a commit to sschuberth/scancode-toolkit that referenced this issue Jan 12, 2017

sschuberth cli: Read a file's SHA1 from the cache instead of recalculating it
Closes #448.
0cde194

nexB/scancode-toolkit

Creating spdx-tv output sometimes fails with an exception #448

Assignees

Labels

Projects

Milestone

2 participants