Creating spdx-tv output sometimes fails with an exception #448

Open
sschuberth opened this Issue Jan 12, 2017 · 6 comments

Projects

None yet

2 participants

@sschuberth
Collaborator

Not sure yet what's going on, but I'd like to document it here. For me running

scancode --diag --timeout 180 -n 2 -f spdx-tv report.spdx

might end up with

  File "/home/jenkins/jobs/scancode/workspace/tools/scancode/src/scancode/cli.py", line 335, in scancode
    save_results(files_count, results, format, input, output_file)
  File "/home/jenkins/jobs/scancode/workspace/tools/scancode/src/scancode/cli.py", line 700, in save_results
    file_entry.chk_sum = Algorithm('SHA1', file_entry.calc_chksum())
  File "/home/jenkins/jobs/scancode/workspace/tools/scancode/local/lib/python2.7/site-packages/spdx/file.py", line 151, in calc_chksum
    with open(self.name, 'rb') as file_handle:
IOError: [Errno 21] Is a directory: u'bin'

@pombredanne In Algorithm('SHA1', file_entry.calc_chksum()) the file_entry comes from here, so it looks like file_data['path'] might contain the path to a directory instead of to a file. Is that correct / by design?

@pombredanne
Member

@sschuberth yes, by design directories are also returned in a scan. Rather than letting the spdx library compute the sha1 with file_entry.calc_chksum(), you should IMHO get the info scan always and use the computed sha1 that's there. It will be empty for directories.

@sschuberth
Collaborator

I agree it makes sense to reuse an existing SHA1 for a file if it's known, but my understanding is that it's not known unless you pass -i to ScanCode, which we don't do currently to reduce the scanning time. Or am I mistaken?

@pombredanne
Member

Actually the file information are always collected, in particular because they are used for cache handling. Other type and related information are also always computed as they are by various scans too. So a sha1 is always computed whether or not you ask for it in the scan with -i and only returned in the scan if you asked for the --info ... https://github.com/nexB/scancode-toolkit/blob/63c09f977bd74d3e6f2d402daf314823a0ffb3f1/src/scancode/cli.py#L465 is where this is always collected.

@sschuberth
Collaborator

So a sha1 is always computed whether or not you ask for it in the scan

Ah! That sort of explains why the scan is so slow ;-P

@pombredanne
Member

I am pretty sure the impact of a SHA1 on scanning times is pretty small. In particular this allows to cache and stream results to support multiprocessing and a side effect of caching is that a file is scanned only once in a codebase that would contain multiple times the same file.

Though it could be worth measuring it of course.

@sschuberth
Collaborator
sschuberth commented Jan 12, 2017 edited

So, I tried to read the SHA1 from the cache, but it fails for know, see PR #449. The file name is not found in the hash table, it seems:

  File "scancode-toolkit/src/scancode/cli.py", line 702, in save_results
    file_entry.chk_sum = Algorithm('SHA1', cache.get_info(file_entry.name).get('sha1'))
AttributeError: 'NoneType' object has no attribute 'get'

Would you mind having a look to point me into the right direction?

@sschuberth sschuberth added a commit to sschuberth/scancode-toolkit that referenced this issue Jan 12, 2017
@sschuberth sschuberth cli: Read a file's SHA1 from the cache instead of recalculating it
Closes #448.
0cde194
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment