OMI Data Pipeline

Projects that follow the best practices below can voluntarily self-certify and show that they've achieved an Open Source Security Foundation (OpenSSF) best practices badge.

If this is your project, please show your badge status on your project page! The badge status looks like this: Badge level for project 9549 is in_progress Here is how to embed it:

These are the Passing level criteria. You can also view the Silver or Gold level criteria.

        

 Basics 11/13

  • Identification

    This project is part of the Open Model Initiative's efforts to train open AI/ML foundation models. The Data Pipeline repository serves as a full end to end application and pipeline for the collection and curation of data necessary for the training process.

    What programming language(s) are used to implement the project?
  • Basic project website content


    The project website MUST succinctly describe what the software does (what problem does it solve?). [description_good]

    This information is currently in the Github readme, https://github.com/Open-Model-Initiative/OMI-Data-Pipeline/ - It will be available in the future at openmodel.foundation



    The project website MUST provide information on how to: obtain, provide feedback (as bug reports or enhancements), and contribute to the software. [interact]

    This information is in the Contributing file of the main repostiory. https://github.com/Open-Model-Initiative/OMI-Data-Pipeline/



    La información sobre cómo contribuir DEBE explicar el proceso de contribución (por ejemplo, ¿se utilizan "pull requests" en el proyecto?) (URL required) [contribution]

    Non-trivial contribution file in repository: https://github.com/Open-Model-Initiative/OMI-Data-Pipeline/blob/main/CONTRIBUTING.md - However it can be improved to expand on the contribution process further.



    The information on how to contribute SHOULD include the requirements for acceptable contributions (e.g., a reference to any required coding standard). (URL required) [contribution_requirements]

    The Contributing file does not include details for acceptable contributions currently, but will be updated to.


  • FLOSS license

    What license(s) is the project released under?



    The software produced by the project MUST be released as FLOSS. [floss_license]

    The Apache-2.0 license is approved by the Open Source Initiative (OSI).



    It is SUGGESTED that any required license(s) for the software produced by the project be approved by the Open Source Initiative (OSI). [floss_license_osi]

    The Apache-2.0 license is approved by the Open Source Initiative (OSI).



    The project MUST post the license(s) of its results in a standard location in their source repository. (URL required) [license_location]

    Non-trivial license location file in repository: https://github.com/Open-Model-Initiative/OMI-Data-Pipeline/blob/main/LICENSE.


  • Documentation


    The project MUST provide basic documentation for the software produced by the project. [documentation_basics]

    This is currently a WIP project without having shipped the first version for public consumption. Documentation will be added in the future.



    The project MUST provide reference documentation that describes the external interface (both input and output) of the software produced by the project. [documentation_interface]

    This is currently a WIP project without having shipped the first version for public consumption. Documentation will be added in the future.


  • Other


    The project sites (website, repository, and download URLs) MUST support HTTPS using TLS. [sites_https]

    Given only https: URLs.



    The project MUST have one or more mechanisms for discussion (including proposed changes and issues) that are searchable, allow messages and topics to be addressed by URL, enable new people to participate in some of the discussions, and do not require client-side installation of proprietary software. [discussion]

    GitHub supports discussions on issues and pull requests.



    The project SHOULD provide documentation in English and be able to accept bug reports and comments about code in English. [english]

    All documentation and operations are performed in English.



    The project MUST be maintained. [maintained]


(Advanced) What other users have additional rights to edit this badge entry? Currently: []



  • Repositorio público para el control de versiones de código fuente


    El proyecto DEBE tener un repositorio público para el control de versiones de código fuente que sea legible públicamente y tenga URL. [repo_public]

    Repository on GitHub, which provides public git repositories with URLs.



    El repositorio fuente del proyecto DEBE rastrear qué cambios se realizaron, quién realizó los cambios y cuándo se realizaron los cambios. [repo_track]

    Repository on GitHub, which uses git. git can track the changes, who made them, and when they were made.



    To enable collaborative review, the project's source repository MUST include interim versions for review between releases; it MUST NOT include only final releases. [repo_interim]

    No releases have been made yet, but interim releases (Release Candidates) are planned.



    It is SUGGESTED that common distributed version control software be used (e.g., git) for the project's source repository. [repo_distributed]

    Repository on GitHub, which uses git. git is distributed.


  • Numeración única de versión


    The project results MUST have a unique version identifier for each release intended to be used by users. [version_unique]

    No releases have been made yet, but version identifiers are planned for consolidated releases.



    It is SUGGESTED that the Semantic Versioning (SemVer) or Calendar Versioning (CalVer) version numbering format be used for releases. It is SUGGESTED that those who use CalVer include a micro level value. [version_semver]


    It is SUGGESTED that projects identify each release within their version control system. For example, it is SUGGESTED that those using git identify each release using git tags. [version_tags]

    No releases have been made yet, but tagging releases with git tags is planned.


  • Notas de lanzamiento


    The project MUST provide, in each release, release notes that are a human-readable summary of major changes in that release to help users determine if they should upgrade and what the upgrade impact will be. The release notes MUST NOT be the raw output of a version control log (e.g., the "git log" command results are not release notes). Projects whose results are not intended for reuse in multiple locations (such as the software for a single website or service) AND employ continuous delivery MAY select "N/A". (URL required) [release_notes]

    No releases have been made yet, but we will include release information in the releases page of Github. https://github.com/Open-Model-Initiative/OMI-Data-Pipeline/releases



    The release notes MUST identify every publicly known run-time vulnerability fixed in this release that already had a CVE assignment or similar when the release was created. This criterion may be marked as not applicable (N/A) if users typically cannot practically update the software themselves (e.g., as is often true for kernel updates). This criterion applies only to the project results, not to its dependencies. If there are no release notes or there have been no publicly known vulnerabilities, choose N/A. [release_notes_vulns]

    https://github.com/Open-Model-Initiative/OMI-Data-Pipeline/releases - We have not yet made a release and do not currently have publicly known vulnerabilities.


  • Bug-reporting process


    The project MUST provide a process for users to submit bug reports (e.g., using an issue tracker or a mailing list). (URL required) [report_process]

    The project SHOULD use an issue tracker for tracking individual issues. [report_tracker]

    The project MUST acknowledge a majority of bug reports submitted in the last 2-12 months (inclusive); the response need not include a fix. [report_responses]

    We have not yet had bug reports submitted, but will plan to respond to bug reports timely.



    The project SHOULD respond to a majority (>50%) of enhancement requests in the last 2-12 months (inclusive). [enhancement_responses]

    We have not yet had enhancement requests submitted, but will plan to respond to these timely.



    El proyecto DEBE tener un archivo públicamente disponible para informes y respuestas para búsquedas posteriores. (URL required) [report_archive]
  • Proceso de informe de vulnerabilidad


    The project MUST publish the process for reporting vulnerabilities on the project site. (URL required) [vulnerability_report_process]

    Site to be published.



    If private vulnerability reports are supported, the project MUST include how to send the information in a way that is kept private. (URL required) [vulnerability_report_private]

    Formal vulnerability reporting process has not yet been defined until releases are made.



    The project's initial response time for any vulnerability report received in the last 6 months MUST be less than or equal to 14 days. [vulnerability_report_response]

    Formal vulnerability reporting process has not yet been defined until releases are made - This is not yet able to be demonstrated.


  • Working build system


    Si el software generado por el proyecto requiere ser construido para su uso, el proyecto DEBE proporcionar un sistema de compilación que pueda satisfactoriamente reconstruir automáticamente el software a partir del código fuente. [build]

    This is being built alongside the code process.



    Se SUGIERE que se utilicen herramientas comunes para construir el software. [build_common_tools]

    Common tools are used to build the software.



    El proyecto DEBERÍA ser construible usando solo herramientas FLOSS. [build_floss_tools]

    The tool is buildable using only FLOSS.


  • Automated test suite


    The project MUST use at least one automated test suite that is publicly released as FLOSS (this test suite may be maintained as a separate FLOSS project). The project MUST clearly show or document how to run the test suite(s) (e.g., via a continuous integration (CI) script or via documentation in files such as BUILD.md, README.md, or CONTRIBUTING.md). [test]

    We do have multiple test suites for the project. Some of which like Playwright are in CI/CD, others are run via task files and such. GETTING_STARTED.md covers running "task test-all" however



    Un conjunto de pruebas DEBERÍA ser invocable de forma estándar para ese lenguaje. [test_invocation]

    Warning: Requires lengthier justification.



    It is SUGGESTED that the test suite cover most (or ideally all) the code branches, input fields, and functionality. [test_most]

    We do not currently calculate any code coverage metrics. Will add an issue.



    It is SUGGESTED that the project implement continuous integration (where new or changed code is frequently integrated into a central code repository and automated tests are run on the result). [test_continuous_integration]

    Done for Playwright but not for our other test sets. Will add an issue.


  • New functionality testing


    The project MUST have a general policy (formal or not) that as major new functionality is added to the software produced by the project, tests of that functionality should be added to an automated test suite. [test_policy]

    We will convey via word of mouth a standard policy for adding tests.



    The project MUST have evidence that the test_policy for adding tests has been adhered to in the most recent major changes to the software produced by the project. [tests_are_added]

    No releases yet.



    It is SUGGESTED that this policy on adding tests (see test_policy) be documented in the instructions for change proposals. [tests_documented_added]

    See above.


  • Banderas de advertencia


    The project MUST enable one or more compiler warning flags, a "safe" language mode, or use a separate "linter" tool to look for code quality errors or common simple mistakes, if there is at least one FLOSS tool that can implement this criterion in the selected language. [warnings]

    We have Flake8 setup for Python Linting



    El proyecto DEBE abordar las advertencias. [warnings_fixed]

    We address warnings where configured.



    It is SUGGESTED that projects be maximally strict with warnings in the software produced by the project, where practical. [warnings_strict]

  • Conocimiento de desarrollo seguro


    The project MUST have at least one primary developer who knows how to design secure software. (See ‘details’ for the exact requirements.) [know_secure_design]


    At least one of the project's primary developers MUST know of common kinds of errors that lead to vulnerabilities in this kind of software, as well as at least one method to counter or mitigate each of them. [know_common_errors]

  • Use buenas prácticas criptográficas

    Note that some software does not need to use cryptographic mechanisms. If your project produces software that (1) includes, activates, or enables encryption functionality, and (2) might be released from the United States (US) to outside the US or to a non-US-citizen, you may be legally required to take a few extra steps. Typically this just involves sending an email. For more information, see the encryption section of Understanding Open Source Technology & US Export Controls.

    The software produced by the project MUST use, by default, only cryptographic protocols and algorithms that are publicly published and reviewed by experts (if cryptographic protocols and algorithms are used). [crypto_published]


    Si el software producido por el proyecto es una aplicación o una librería, y su propósito principal no es implementar criptografía, entonces DEBE SOLAMENTE invocar un software específicamente diseñado para implementar funciones criptográficas; NO DEBERÍA volver a implementar el suyo. [crypto_call]


    All functionality in the software produced by the project that depends on cryptography MUST be implementable using FLOSS. [crypto_floss]


    The security mechanisms within the software produced by the project MUST use default keylengths that at least meet the NIST minimum requirements through the year 2030 (as stated in 2012). It MUST be possible to configure the software so that smaller keylengths are completely disabled. [crypto_keylength]


    The default security mechanisms within the software produced by the project MUST NOT depend on broken cryptographic algorithms (e.g., MD4, MD5, single DES, RC4, Dual_EC_DRBG), or use cipher modes that are inappropriate to the context, unless they are necessary to implement an interoperable protocol (where the protocol implemented is the most recent version of that standard broadly supported by the network ecosystem, that ecosystem requires the use of such an algorithm or mode, and that ecosystem does not offer any more secure alternative). The documentation MUST describe any relevant security risks and any known mitigations if these broken algorithms or modes are necessary for an interoperable protocol. [crypto_working]


    The default security mechanisms within the software produced by the project SHOULD NOT depend on cryptographic algorithms or modes with known serious weaknesses (e.g., the SHA-1 cryptographic hash algorithm or the CBC mode in SSH). [crypto_weaknesses]


    The security mechanisms within the software produced by the project SHOULD implement perfect forward secrecy for key agreement protocols so a session key derived from a set of long-term keys cannot be compromised if one of the long-term keys is compromised in the future. [crypto_pfs]


    If the software produced by the project causes the storing of passwords for authentication of external users, the passwords MUST be stored as iterated hashes with a per-user salt by using a key stretching (iterated) algorithm (e.g., Argon2id, Bcrypt, Scrypt, or PBKDF2). See also OWASP Password Storage Cheat Sheet. [crypto_password_storage]


    The security mechanisms within the software produced by the project MUST generate all cryptographic keys and nonces using a cryptographically secure random number generator, and MUST NOT do so using generators that are cryptographically insecure. [crypto_random]

  • Entrega garantizada contra ataques de hombre en el medio (MITM)


    The project MUST use a delivery mechanism that counters MITM attacks. Using https or ssh+scp is acceptable. [delivery_mitm]

    Will use https on launch.



    A cryptographic hash (e.g., a sha1sum) MUST NOT be retrieved over http and used without checking for a cryptographic signature. [delivery_unsigned]

  • Vulnerabilidades públicamente conocidas corregidas


    There MUST be no unpatched vulnerabilities of medium or higher severity that have been publicly known for more than 60 days. [vulnerabilities_fixed_60_days]


    Projects SHOULD fix all critical vulnerabilities rapidly after they are reported. [vulnerabilities_critical_fixed]

  • Otros problemas de seguridad


    The public repositories MUST NOT leak a valid private credential (e.g., a working password or private key) that is intended to limit public access. [no_leaked_credentials]

  • Análisis estático de código


    At least one static code analysis tool (beyond compiler warnings and "safe" language modes) MUST be applied to any proposed major production release of the software before its release, if there is at least one FLOSS tool that implements this criterion in the selected language. [static_analysis]

    Sonarcloud and Dependabot



    It is SUGGESTED that at least one of the static analysis tools used for the static_analysis criterion include rules or approaches to look for common vulnerabilities in the analyzed language or environment. [static_analysis_common_vulnerabilities]


    All medium and higher severity exploitable vulnerabilities discovered with static code analysis MUST be fixed in a timely way after they are confirmed. [static_analysis_fixed]


    It is SUGGESTED that static source code analysis occur on every commit or at least daily. [static_analysis_often]

  • Dynamic code analysis


    It is SUGGESTED that at least one dynamic analysis tool be applied to any proposed major production release of the software before its release. [dynamic_analysis]

    We can add an item to implement OWASP ZAP as policy and Github action in the future: https://github.com/zaproxy/action-baseline



    It is SUGGESTED that if the software produced by the project includes software written using a memory-unsafe language (e.g., C or C++), then at least one dynamic tool (e.g., a fuzzer or web application scanner) be routinely used in combination with a mechanism to detect memory safety problems such as buffer overwrites. If the project does not produce software written in a memory-unsafe language, choose "not applicable" (N/A). [dynamic_analysis_unsafe]

    Memory safe languages



    It is SUGGESTED that the project use a configuration for at least some dynamic analysis (such as testing or fuzzing) which enables many assertions. In many cases these assertions should not be enabled in production builds. [dynamic_analysis_enable_assertions]


    All medium and higher severity exploitable vulnerabilities discovered with dynamic code analysis MUST be fixed in a timely way after they are confirmed. [dynamic_analysis_fixed]

    See above - not implemented yet.



This data is available under the Community Data License Agreement – Permissive, Version 2.0 (CDLA-Permissive-2.0). This means that a Data Recipient may share the Data, with or without modifications, so long as the Data Recipient makes available the text of this agreement with the shared Data. Please credit Kent Keirsey and the OpenSSF Best Practices badge contributors.

Project badge entry owned by: Kent Keirsey.
Entry created on 2024-10-15 11:50:10 UTC, last updated on 2024-11-08 01:35:25 UTC.

Back